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Introductie 

In  een  voorgaande  studie  heeft  TNO 
Defensie  en  Veiligheid,  locatie  Den  Haag  in 
opdracht  van  het  Ministerie  van  Defensie 
algoritmen  ontwikkeld  die  doelen  met  een 
specifieke  magnetische  signatuur  kunnen 
detecteren.  Deze  algoritmen  bepalen  een 
aantal  parameters  die  karakteristiek  zijn 
voor  de  betreffende  signatuur.  Bij 
praktische  implementatie  van  de 
algoritmen  dient  zich  een  belangrijke  vraag 
aan:  ‘Met  hoeveel  parameters  moeten  we 
rekening  houden?’  Dit  zogenaamde 
modelselectieprobleem,  een  essentieel 
onderdeel  van  de  algoritmen,  is  in  de 
voorgaande  studie  niet  aan  de  orde  gesteld. 

Modelselectie  onderscheidt  zich  als  ten  van 
de  belangrijkste  problemen  binnen  het 
vakgebied  van  de  statistische  deductie.  Het 
is  een  activiteit  met  de  bedoeling  regels  en 
beperkingen  te  leren  aan  de  hand  van 
gemeten  data.  Uiteindelijk  wordt  een 
hypothese  geaccepteerd  dan  wel 
verworpen.  Ben  voorbeeld  van  een 
dergelijke  hypothese  is:  ‘het  aantal 
parameters  voor  voertuig  A  is  10.* 

Modelselectie 

Het  statistisch  modelleren  houdt  zich  bezig 
met  het  vinden  van  algemene  regels  uit 
waargenomen  data.  In  het  kort  komt  dit 
neer  op  het  extraheren  van  informatie  uit 
beschikbare  data.  In  het  modelleren 
moeten  we  niet  meer  toelaten  dan  strikt 
noodzakelijk,  dat  wil  zeggen  ‘hebben  we  de 
keuze  tussen  indifferente  altematieven, 
dan  moeten  we  de  simpelste  kiezen.* 

Het  doel  van  modelselectie  is  het  zoeken 
van  regelmatigheden  in  de  data. 

‘Regelmaat’  kan  worden  ge'identificeerd 
met  ‘de  mogelijkheid  tot  compressie,’  Het 
combineren  van  deze  twee  begrippen  moet 
ons  in  staat  stellen  om,  gegeven  een 
verzameling  van  hypothesen  en  data,  de 
hypothese  te  vinden  die  de  data  het  meest 


The  World 


le  Estimator 


TNO-DVl  2004  A234 


comprimeert.  Hiertoe  beschrijven  we  een 
principe  genaamd  ‘minimum  description 
length’,  dat  gebaseerd  is  op  het  zoeken  van 
regelmaat  in  data.  Dit  principe  sluit  een 
compromis  tussen  aanpassingsgraad  van 
de  data  en  de  complexiteit  van  het  model. 
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Betrouwbare  deducties 
Met  betrouwbare  deducties  kunnen  we 
goede  voorspellingen  en  beslissingen 
nemen  met  betrekking  tot  de  data.  Het 
stelt  ons  in  staat  om  te  bepalen  wanneer 
we  simpelere  modellen  kunnen  gebruiken 
en  wanneer  niet.  In  het  algemeen  zijn  we 
dus  gei’nteresseerd  in  wat  al  dan  niet 
betrouwbaar  kan  worden  voorspeld  met 
een  deels  correct  model. 


In  het  rapport  beschrijven  we  een  nieuwe 
procedure  genaamd  ‘entropificatie*.  We 
kunnen  met  een  ‘geentropificeerd’  model, 
mits  we  over  genoeg  data  beschikken,  een 
model  vinden  met  de  kleinste  fout.  Tevens 
geeft  het  een  juiste  schatting  van  de 
gemiddelde  fout.  Het  geeft  dus  een  goede 
indruk  ‘hoe  goed  het  werkelijk  is*. 

Toepassing 

Met  de  beschreven  technieken  is  het 
mogelijk  om  voertuigsignaturen  met  grote 
zekerheid  te  herkennen  uit  gemeten  data. 
Het  uitvoeren  van  een  groot  aantal 
signatuurmetingen  leidt  tot  a-priori  kennis, 
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waardoor  deductie  nog  betrouwbaarder  Tenslotte  wordt  opgemerkt  dat  de  theorie 

wordt.  algemeen  toepasbaar  is  en  niet  alleen  op 

magnetische  voertuigsignaturen. 
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1.  Introduction 


Signal  processing  is  concerned  with  the  representation,  manipulation,  and  transfor¬ 
mation  of  signals  and  the  information  that  they  carry.  For  example,  we  may  wish  to 
enhance  a  signal  by  reducing  the  noise  or  some  other  interference.  We  also  may  want 
to  classify  an  object  by  means  of  its  signal  content.  In  order  to  classify  we  need  an 
appropriate  model.  An  essential  constituent  is  the  model  length.  The  correct  choice 
of  the  model  length  improves  the  use  of  the  model  in  classification.  Once  it  is  possi¬ 
ble  to  accurately  model  a  signal,  it  then  becomes  possible  to  perform  important  signal 
processing  tasks. 

Conditioning  on  a  single  method  ignores  model  uncertainty,  and  thus  leads  to  the  un¬ 
derestimation  of  uncertainty  when  making  inferences  about  quantities  of  interest.  A 
complete  Bayesian  solution  to  this  problem  involves  averaging  over  all  possible  mod¬ 
els  when  making  such  inferences.  This  approach  is  often  not  practical.  An  alternative 
approach  involves  averaging  over  a  reduced  set  of  models.  Further,  one  can  directly  ap¬ 
proximate  the  complete  solution  by  applying  a  Markov  chain  Monte  Carlo  approach. 
In  this  approach  the  posterior  distribution  of  a  quantity  of  interest  is  approximated  by 
a  Markov  chain  Monte  Carlo  method  which  generates  a  process  that  moves  through 
model  space.  This  is  discussed  in  chapter  5. 

In  order  to  assess  the  predictive  ability  of  the  selected  models  for  future  observations 
we  must  develop  a  measure  of  the  effectiveness  of  a  model  selection  strategy.  A  pos¬ 
sible  approach  is  to  compare  the  quality  of  the  predictions  based  on  model  averaging 
with  the  quality  of  predictions  based  on  any  single  model.  The  choice  of  which  proce¬ 
dure  to  use  will  depend  on  the  particular  application. 

Reliable  inferences  allow  one  to  make  good  predictions  and  decisions  regarding  the 
data  under  a  much  wider  variety  of  assumptions  than  unreliable  inferences  do.  It  will 
allow  us  to  establish  in  what  way  we  can  and  in  what  way  we  cannot  use  overly  simple 
models.  In  general,  we  will  be  interested  in  what  can  be  reliably  predicted  -  and  what 
not  -  from  a  model  that  is  only  partially  correct. 

With  an  entropified  model,  if  given  enough  data,  we  can  find  the  model  with  the  small¬ 
est  expected  prediction  error.  This  model  will  provide  a  correct  estimate  of  the  average 
prediction  error  that  it  will  achieve;  hence  the  model  gives  a  good  impression  of  ’how 
good  it  really  is’  when  errors  are  measured.  Entropification  is  covered  in  chapter  6. 

A  detailed  discussion  concerning  how  the  problematic  issues  are  resolved  is  presented 
in  the  epilogue.  We  start  with  an  introduction  to  make  intuitive  the  theory  dealing 
with  the  quantity  of  information  in  individual  objects,  continued  by  an  non-technical 
introduction  to  the  MDL  principle.  We  conclude  the  introduction  making  precise  the 
non-technicalities. 


TNO  report 


8 


TNO-DV1  2004  A234 


TNO  report 


TNO-DV1  2004  A234 


9 


2.  Kolmogorov  complexity  and  information  theory 


How  should  we  measure  the  amount  of  information  about  a  phenomenon  that  is  given 
to  us  by  a  particular  observation  concerning  the  phenomenon? 

Shannon  information  theory,  usually  called  just  ‘information’  theory,  was  introduced 
in  1948  by  C.E.  Shannon  (1916-2001).  Kolmogorov  complexity  theory  is  also  known 
as  ‘algorithmic  information’  theory.  It  was  introduced  independently  and  with  differ¬ 
ent  motivations  by  R.J.  Solomonoff  (bom  1926),  A.N.  Kolmogorov  (1903-1987)  and 
G.  Chaitin  (bom  1943)  in  1960/1964,  1965  and  1966  respectively.  Both  theories  aim 
at  providing  a  means  for  measuring  ‘information*.  They  use  the  same  unit  to  do  this: 
the  bit.  In  both  cases,  the  amount  of  information  in  an  object  may  be  interpreted  as  the 
length  of  a  description  of  the  object.  In  the  Shannon  approach,  however,  the  method 
of  encoding  objects  is  based  on  the  presupposition  that  the  objects  to  be  encoded  are 
outcomes  of  a  known  random  source  -  it  is  only  the  characteristics  of  that  random 
source  that  determine  the  encoding,  not  the  characteristics  of  the  objects  that  are  its 
outcomes.  In  the  Kolmogorov  complexity  approach  we  consider  the  individual  objects 
themselves,  in  isolation  so-to-speak,  and  the  encoding  of  an  object  is  a  computer  pro¬ 
gram  (Turing  machine)  that  generates  it  and  then  halts.  In  the  Shannon  approach  we 
are  interested  in  the  minimum  expected  number  of  bits  to  transmit  a  message  from  a 
random  source  of  known  characteristics  through  an  error-free  channel.  In  Kolmogorov 
complexity  we  are  interested  in  the  minimum  number  of  bits  from  which  a  particular 
message  can  effectively  be  reconstructed.  A  little  reflection  reveals  that  this  is  a  great 
difference:  for  every  source  emitting  but  two  messages  the  Shannon  information  is  at 
most  1  bit,  but  we  can  choose  both  messages  concerned  of  arbitrarily  high  Kolmogorov 
complexity.  Shannon  stresses  in  his  founding  article  that  his  notion  is  only  concerned 
with  communication,  while  Kolmogorov  stresses  in  his  founding  article  that  his  notion 
aims  at  supplementing  the  gap  left  by  Shannon  theory  concerning  the  information  in 
individual  objects.  To  be  sure,  both  notions  are  natural:  Shannon  ignores  the  object 
itself  but  considers  only  the  characteristics  of  the  random  source  of  which  the  object 
is  one  of  the  possible  outcomes,  while  Kolmogorov  considers  only  the  object  itself  to 
determine  the  number  of  bits  in  the  ultimate  compressed  version  irrespective  of  the 
manner  in  which  the  object  arose.  Furthermore,  we  note  that  Shannon’s  approach  is 
based  on  probability  distributions  and  the  approach  of  Kolmogorov  dispenses  with 
this  notion. 

In  this  chapter,  we  introduce,  compare  and  contrast  the  Shannon  and  Kolmogorov  ap¬ 
proaches.  We  do  this  by  switching  back  and  forth  between  the  two  theories,  according 
to  the  following  pattern:  we  first  discuss  a  concept  of  Shannon’s  theory,  discuss  its 
properties  as  well  as  some  questions  it  leaves  open.  We  then  provide  Kolmogorov’s 
analogue  of  the  concept  and  show  how  it  answers  the  questions  left  open  by  Shannon’s 
theory.  We  use  as  our  guiding  motif  the  communication  between  a  sender  A  and  a  re¬ 
ceiver  B;  where  appropriate,  we  also  discuss  the  related  setting  of  a  question- answer 
session  between  B  and  A. 

To  obtain  an  understanding  of  the  two  theories  and  how  they  relate,  it  is  crucial  to  read 
the  overview  below  and  then  section  2.2  and  section  2.3,  which  discuss  preliminaries, 
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fix  notation  and  introduce  the  basic  notions.  The  other  sections  are  written  in  a  way  so 
that  they  can  be  read  separately  from  one  another.  Throughout  the  chapter,  we  assume 
some  basic  familiarity  with  elementary  notions  of  probability  theory  and  computation. 

The  chapter  does  not  contain  any  new  results.  All  theorems  that  we  present  here,  as 
well  as  further  details,  context  and  discussion,  can  be  found  in  either  of  two  stan¬ 
dard  text  books:  (Cover  and  Thomas,  1991,  [20]),  the  standard  reference  on  Shannon 
information  theory,  and/or  (Li  and  Vit£nyi,  1997,  [71]),  the  standard  reference  on  Kol¬ 
mogorov  complexity. 


2.1  Overview  and  summary 

A  summary  of  the  basic  ideas  is  given  below.  In  the  chapter,  these  notions  are  discussed 

in  the  same  order. 

1.  Coding:  Prefix  codes,  Kraft  inequality.  Since  descriptions  or  encoding  of  ob¬ 
jects  are  fundamental  to  both  theories,  we  first  review  some  elementary  facts 
about  coding.  The  most  important  of  these  is  the  Kraft  inequality.  This  inequal¬ 
ity  gives  the  fundamental  relationship  between  probability  mass  functions  and 
prefix  codes,  which  are  the  type  of  codes  we  are  interested  in  (section  2.2). 

2.  Shannon’s  Fundamental  Concept:  Entropy  is  defined  as  a  functional  that  maps 
probability  distributions  or,  equivalently,  random  variables,  to  real  numbers. 
This  notion  is  derived  from  first  principles  as  the  only  ‘reasonable’  way  to 
measure  the  ‘average  amount  of  information  conveyed  when  an  outcome  of  the 
random  variable  is  observed’.  The  notion  is  then  related  to  encoding  and  com¬ 
municating  messages  by  Shannon’s  famous  ‘coding  theorem’  (section  2.3.1). 

3.  Kolmogorov’s  Fundamental  Concept:  Kolmogorov  Complexity  is  defined  as  a 
function  that  maps  objects  (to  be  thought  of  as  natural  numbers  or  sequences 
of  symbols)  to  the  natural  numbers.  Intuitively,  the  Kolmogorov  complexity  of 
a  sequence  is  the  length  (in  bits)  of  the  shortest  computer  program  that  prints 
the  sequence  and  then  halts  (Section  2.3.2). 

4.  Universal  Coding:  interpolating  between  Shannon  and  Kolmogorov.  Although 
their  primary  aim  is  quite  different,  and  they  are  functions  defined  on  different 
spaces,  there  are  close  relations  between  entropy  and  Kolmogorov  complexity. 
These  are  best  illustrated  by  explaining  ‘universal  coding’  which  combines 
elements  from  both  Shannon’s  and  Kolmogorov’s  theory,  and  which  lies  at  the 
basis  of  most  practical  data  compression  methods  (Section  2.4). 

Entropy  and  Kolmogorov  Complexity  are  the  basic  notions  of  the  two  theories.  They 

serve  as  building  blocks  for  all  other  important  notions  in  the  respective  theories.  Ar¬ 
guably  the  most  important  of  these  notions  is  mutual  information: 

5.  Mutual  Information  for  Shannon  and  Kolmogorov:  Entropy  and  Kolmogorov 
complexity  are  concerned  with  information  in  a  single  object:  a  random  vari¬ 
able  (Shannon)  or  an  individual  sequence  (Kolmogorov).  Both  theories  pro¬ 
vide  a  (distinct)  notion  of  mutual  information  that  measures  the  information 
that  one  object  gives  about  another  object.  In  Shannon’s  theory,  this  is  the  in¬ 
formation  that  one  random  variable  carries  about  another;  in  Kolmogorov’s 
theory  (‘algorithmic  mutual  information’),  it  is  the  information  a  sequence 
gives  about  another  one  (Section  2.5). 
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Entropy,  Kolmogorov  complexity  and  mutual  information  are  concerned  with  loss¬ 
less  description  or  compression:  messages  must  be  described  in  such  a  way  that  from 
the  description,  the  original  message  can  be  completely  reconstructed.  Extending  the 
theories  to  lossy  description  or  compression  enables  the  formalization  of  more  sophis¬ 
ticated  concepts,  such  as  ‘meaningful  information’  and  ‘useful  information’.  ‘Mean¬ 
ingful  information’  is  defined  in  the  Kolmogorov  framework  using  the  Kolmogorov 
structure  function.  ‘Useful  information’  is  defined  in  Shannon’s  framework  using  the 
rate-distortion  function.  We  end  the  chapter  with  a  brief  treatment  of  the  latter: 

6.  Useful  Information:  Rate-distortion  theory  is  the  part  of  Shannon  information 
theory  that  deals  with  the  following  situation:  The  sender  is  only  allowed  to 
use  a  fixed  (small)  number  of  bits  to  send  his  message.  The  goal  is  then  to  send 
the  most  useful  or  valuable  message  given  this  constraint  (Section  2.6). 

2.2  The  coding  framework  and  the  Kraft  inequality 
Notational  preliminaries: 

1.  Strings  Let  X  be  some  finite  or  countable  set.  We  use  the  notation  X*  to 
denote  the  set  of  finite  strings  or  sequences  over  X  .  For  example, 

{0, 1}*  =  {e,  0, 1, 00, 01, 10, 11, 000,  *  *  •  }, 

with  e  denoting  the  empty  word  *’  with  no  letters.  Let  x,y,z  E  Af,  where 
'Af  denotes  the  natural  numbers.  We  identify  A f  and  {0, 1}*  according  to  the 
correspondence 

(0,  e),  (1, 0),  (2, 1),  (3, 00),  (4, 01),  *  •  •  (2.1) 

The  length  l(x )  of  x  is  the  number  of  bits  in  the  binary  string  x .  For  example, 
1(010)  =  3  and  1(e)  =  0.  If  x  is  interpreted  as  an  integer,  it  can  be  shown  that 
l(x)  s=  [log(x  4-  1)J  and,  for  x  >  2, 

[log xj  <  l(x)  <  [logo:].  (2.2) 

Here,  as  in  the  sequel,  \x]  is  the  smallest  integer  larger  than  or  equal  to  x,  [xj 
is  the  largest  integer  smaller  than  or  equal  to  x  and  log  denotes  logarithm  to 
base  two.  We  shall  typically  be  concerned  with  encoding  finite-length  binary 
strings  by  other  finite-length  binary  strings.  The  emphasis  is  on  binary  strings 
only  for  convenience;  observations  in  any  alphabet  can  be  so  encoded  in  a  way 
that  is  ‘theory  neutral’. 

2.  (In)equality  up  to  a  constant  We  will  denote  by  <  an  inequality  to  within  an 
additive  constant.  More  precisely,  let  /,  g  be  functions  from  {0, 1}*  to  M.  Then 

by  ‘/(x)  <  g(x)'  we  mean  that  there  exists  a  c  such  that  for  all  x  E  {0, 1}*, 

f(x)  <  g(x)  +  c.  We  denote  by  =  the  situation  when  both  <  and  >  hold. 

3.  Probabilities  Let  P  be  a  probability  distribution  defined  on  a  finite  or  count¬ 
able  set  X  .  Throughout  this  chapter,  we  denote  by  X  the  random  variable  that 
takes  values  in  X ,  thus  P(X  =  x)  =  P({x})  is  the  probability  that  the  event 
{x}  obtains.  We  write  both  P(x)  and  px  as  an  abbreviation  of  P(X  =  x). 
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Codes  We  repeatedly  consider  the  following  scenario:  a  sender  (say,  A)  wants  to  com¬ 
municate  or  transmit  some  information  to  a  receiver  (say,  B).  The  information  to  be 
transmitted  is  an  element  from  some  set  X  .  It  will  be  communicated  by  sending  a  bi¬ 
nary  string,  called  the  message.  When  B  receives  the  message,  he  can  decode  it  again 
and  (hopefully)  reconstruct  the  element  of  X  that  was  sent.  To  achieve  this,  A  and 
B  need  to  agree  on  a  code  or  description  method  before  communicating.  Intuitively, 
this  is  a  binary  relation  between  source  words  and  associated  code  words.  The  rela¬ 
tion  is  fully  characterized  by  the  decoding  function.  Such  a  decoding  function  D  can 
be  any  function  D  :  {0, 1}*  — ►  X.  The  domain  of  D  is  the  set  of  code  words  and 
the  range  of  D  is  the  set  of  source  words.  D(y)  =  x  is  interpreted  as  ‘y  is  a  code 
word  for  the  source  word  x\  The  set  of  all  code  words  for  source  word  x  is  the  set 
D~l(x)  —  {y  :  D(y)  =  x}.  Hence,  E  =  D~1  can  be  called  the  encoding  substitution 
(E  is  not  necessarily  a  function).  With  each  code  D  we  can  associate  a  length  function 
Lj)  :  X  — >  J\f  such  that,  for  each  source  word  x ,  L(x )  is  the  length  of  the  shortest 
encoding  of  x: 

Ld(x)  =  min{/(y)  :  D(y)  =  x}. 

We  denote  by  x*  the  shortest  y  such  that  D(y)  =  x\  if  there  is  more  than  one  such  y, 
then  x *  is  defined  to  be  the  first  such  y  in  lexicographical  order. 

In  coding  theory  attention  is  often  restricted  to  the  case  where  the  source  word  set  is 
finite,  say  X  =  {1,2,**  ,  TV}.  If  there  is  a  constant  Iq  such  that  l(y)  =  Iq  for  all 
code  words  y  (equivalently,  L{x)  =  Iq  for  all  source  words  x),  then  we  call  D  a  fixed- 
length  code.  It  is  easy  to  see  that  Iq  >  log  TV.  For  instance,  in  teletype  transmissions 
the  source  has  an  alphabet  of  TV  =  32  letters,  consisting  of  the  26  letters  in  the  Latin 
alphabet  plus  6  special  characters.  Hence,  we  need  =  6  binary  digits  per  source 
letter.  In  electronic  computers  we  often  use  the  fixed-length  ASCII  code  with  Iq  =  8. 

Prefix  code  It  is  immediately  clear  that  in  general  we  cannot  uniquely  recover  x  and 
y  from  E(xy ).  Let  E  be  the  identity  mapping.  Then  we  have  T5(00)£’(00)  =  0000  = 
i?(0).E(000).  We  now  introduce  prefix  codes,  which  do  not  suffer  from  this  defect. 
A  binary  string  x  is  a  proper  prefix  of  a  binary  string  y  if  we  can  write  y  =  xz  for 
z  $  e.  A  set  {x,  y,  •  •  •  }  C  {0, 1}*  is  prefix- free  if  for  any  pair  of  distinct  elements  in 
the  set  neither  is  a  proper  prefix  of  the  other.  A  function  D  :  {0, 1}*  — >  TV  defines  a 
prefix-code  if  its  domain  is  prefix-free.  In  order  to  decode  a  code  sequence  of  a  prefix- 
code,  we  simply  start  at  the  beginning  and  decode  one  code  word  at  a  time.  When 
we  come  to  the  end  of  a  code  word,  we  know  it  is  the  end,  since  no  code  word  is  the 
prefix  of  any  other  code  word  in  a  prefix-code.  Suppose  we  encode  each  binary  string 
x  —  X\X2  •  •  *xn  as 

x  —  11  •  •  •  1  Ox  1X2  •  •  ■  xn. 
n  times 

The  resulting  code  is  prefix  because  we  can  determine  where  the  code  word  x  ends  by 
reading  it  from  left  to  right  without  backing  up.  Note  l(x)  =  2n  +  1;  thus,  we  have 
encoded  strings  in  {0, 1}*  in  a  prefix  manner  at  the  price  of  doubling  their  length.  We 
can  get  a  much  more  efficient  code  by  applying  the  construction  above  to  the  length 
l(x)  of  x  rather  than  x  itself:  define  xf  =  l(x)x,  where  l(x)  is  interpreted  as  a  binary 
string  according  to  the  correspondence  (2.1).  Then  the  code  Df  with  D'(xf)  =  x  is  a 
prefix  code  satisfying,  for  all  x  e  {0, 1}*,  l(xf)  —  n  +  21ogn  +  1  (here  we  ignore 
the  ‘rounding  error’  in  (2.2)).  Df  is  used  throughout  this  chapter  as  a  standard  code 
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to  encode  natural  numbers  in  a  prefix  free-manner;  we  call  it  the  standard  prefix- code 
for  the  natural  numbers.  We  use  Ljq-(x)  as  notation  for  Z(x').  When  x  is  interpreted  as 
a  number  (using  the  correspondence  (2.1)  and  (2.2)),  we  see  that  L^f{x)  =  logx  + 
2  log  log  x  +  1. 

Prefix  codes  and  the  Kraft  inequality  Let  X  be  the  set  of  natural  numbers  and  con¬ 
sider  the  straightforward  non-prefix  representation  (2.1).  There  are  two  elements  of  X 
with  a  description  of  length  1,  four  with  a  description  of  length  2  and  so  on.  However, 
for  a  prefix  code  D  for  the  natural  numbers  there  are  less  binary  prefix  code  words  of 
each  length:  if  x  is  a  prefix  code  word  then  no  y  =  xz  with  z  $  e  is  a  prefix  code 
word.  Asymptotically  there  are  less  prefix  code  words  of  length  n  than  the  2n  source 
words  of  length  n.  Quantification  of  this  intuition  for  countable  X  and  arbitrary  prefix- 
codes  leads  to  a  precise  constraint  on  the  number  of  code- words  of  given  lengths.  This 
important  relation  is  known  as  the  Kraft  Inequality  and  is  due  to  L.G.  Kraft  [20]. 

Theorem  2.1  (Kraft  inequality) 

Let  l\ ,  Z2,  *  ■  *  be  a  finite  or  infinite  sequence  of  natural  numbers.  There  is  a  prefix-code 
with  this  sequence  as  lengths  of  its  binary  code  words  if 

<  1. 
n 


Uniquely  decodable  codes  We  want  to  code  elements  of  X  in  a  way  that  they  can  be 
uniquely  reconstructed  from  the  encoding.  Such  codes  are  called  ‘uniquely  decodable’. 
Every  prefix-code  is  a  uniquely  decodable  code.  For  example,  let  X  =  {1, 2, 3, 4}*.  If 
i?(l)  =  0,  E{ 2)  =  10,  £(3)  =  110,  E( 4)  =  111  then  1421  is  encoded  as  0111100, 
which  can  be  easily  decoded  from  left  to  right  in  a  unique  way. 

On  the  other  hand,  not  every  uniquely  decodable  code  satisfies  the  prefix  condition. 
Prefix-codes  are  distinguished  from  other  uniquely  decodable  codes  by  the  property 
that  the  end  of  a  code  word  is  always  recognizable  as  such.  This  means  that  decoding 
can  be  accomplished  without  the  delay  of  observing  subsequent  code  words,  which  is 
why  prefix-codes  are  also  called  instantaneous  codes. 

There  is  a  good  reason  for  our  emphasis  on  prefix-codes.  Namely,  it  turns  out  that 
theorem  2.1  stays  valid  if  we  replace  ‘prefix-code’  by  ‘uniquely  decodable  code’.  This 
important  fact  means  that  every  uniquely  decodable  code  can  be  replaced  by  a  prefix- 
code  without  changing  the  set  of  code-word  lengths.  In  Shannon’s  and  Kolmogorov’s 
theories,  we  are  only  interested  in  code  word  lengths  of  uniquely  decodable  codes 
rather  than  the  actual  encoding.  By  the  previous  argument,  we  may  restrict  the  set  of 
codes  we  work  with  to  prefix  codes,  which  are  much  easier  to  handle. 

Probability  distributions  and  complete  prefix  codes  A  uniquely  decodable  code  is 
complete  if  the  addition  of  any  new  code  word  to  its  code  word  set  results  in  a  non- 
uniquely  decodable  code.  It  is  easy  to  see  that  a  code  is  complete  if  equality  holds 
in  the  associated  Kraft  Inequality.  Let  Zx,  •  *  *  be  the  code  words  of  some  complete 
uniquely  decodable  code.  Let  us  define  qx  =  2~lx.  By  definition  of  completeness, 
we  have  9®  ==  1*  Thus,  the  qx  can  be  thought  of  as  probability  mass  functions 
corresponding  to  some  probability  distribution  Q .  We  say  Q  is  the  distribution  corre¬ 
sponding  to  l\,  I2,  *  •  • .  In  this  way,  each  complete  uniquely  decodable  code  is  mapped 
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to  a  unique  probability  distribution.  Of  course,  this  is  nothing  more  than  a  formal  cor¬ 
respondence:  we  may  choose  to  encode  outcomes  of  X  using  a  code  corresponding 
to  a  distribution  Q,  whereas  the  outcomes  are  actually  distributed  according  to  some 
P  $  Q.  But,  as  we  show  in  theorem  2.3  below,  if  X  is  distributed  according  to  P, 
then  the  code  to  which  P  corresponds  is,  in  an  average  sense,  the  code  that  achieves 
optimal  compression  of  X. 

Prefix  codes  as  protocols  for  asking  questions  Prefix  codes  can  be  thought  of  as 
protocols  for  sequentially  asking  yes/no-questions.  To  make  this  precise  we  slightly 
change  our  setting.  We  now  think  of  the  ‘receiver’  B  as  someone  who  sequentially  asks 
questions  about  X.  We  assume  that  the  ‘sender’  A  only  passes  on  information  when 
asked  a  question.  But  in  that  case,  he  answers  truthfully.  The  questions  of  B  must  all 
be  of  the  form  ‘Is  the  realized  value  x  an  element  of  the  set  Xtci\  where  Xf  is  some 
subset  of  X.  B  keeps  asking  such  questions  until  he  has  determined  the  precise  value 
X  =  x.  More  precisely,  B  determines  a  sequence  of  sets  Xe,  X0,  Xi,Xoo,  Afo,  •  •  ■  , 
satisfying  the  following  two  conditions: 

1.  =  X. 

2.  Let  y  £  {0, 1}*.  If  Xy  has  more  than  one  element,  then  Xyo  fi  Xy\  —  0  and 
Xyo  H  Xyi  =  Xyt  If  Xy  has  just  one  element,  then  Xyz  is  undefined  for  any 
continuation  z  of  y. 

The  sets  Xy  determine  B’s  protocol  as  follows.  First,  B  asks  ‘Is  x  £  Xq ?’.  If  the  answer 
is  yes,  then  B’s  next  question  is  ‘Is  x  £  X00 ?’  If  the  answer  is  no,  then  B  knows  that 
x  £  X\  and  B’s  next  question  is  ‘Is  x  £  X^T  If  the  answer  to  the  first  two  questions 
is  yes,  B’s  third  question  is  ‘Is  x  £  Aqoo?’  If  the  answer  to  the  first  question  is  no 
and  to  the  second  yes,  then  B’s  question  is  ‘Is  x  £  Xi$ ?’,  and  so  on.  B  keeps  asking 
questions  in  this  way  until  it  has  precisely  determined  the  value  of  x,  i.e.  until  it  knows 
that  x  £  Xy  for  some  y  such  that  Xy  has  but  one  element. 

To  relate  such  a  sequential  protocol  to  prefix  codes,  consider  the  code  E  defined  as 
follows:  for  all  x  £  X ,  we  set  E(x)  :=  y  for  the  y  such  that  Xy  =  {x}.  In  this  way  all 
a:  £  A'  are  assigned  a  unique  code  word  E(x)  such  that  the  set  of  code  words  is  prefix- 
free.  Therefore,  E  defines  a  prefix-code  that,  for  each  source  word,  reserves  exactly 
one  code  word.  Conversely,  one  can  show  that  each  prefix-code  that  reserves  only  one 
code  word  for  each  source  word  coincides  with  a  sequential  question-protocol. 

Thus,  the  problems  of  prefix-free  encoding  the  value  of  X  and  sequentially  determin¬ 
ing  (by  asking)  the  value  of  X  are  really  equivalent.  This  is  yet  another  reason  why 
prefix  codes  are  more  ‘natural’  than  general  uniquely  decodable  codes. 


2.3  Shannon  entropy  versus  Kolmogorov  complexity 
2.3.1  Shannon  entropy 

It  seldom  happens  that  a  detailed  mathematical  theory  springs  forth  in  essentially  final 
form  from  a  single  publication.  Such  was  the  case  with  Shannon  information  theory, 
which  properly  started  only  with  the  appearance  of  C.E.  Shannon’s  paper  ‘The  math¬ 
ematical  theory  of  communication’  (Shannon,  1948,  [102]).  In  this  paper,  Shannon 
proposed  a  measure  of  information  in  a  distribution,  which  he  called  the  ‘entropy’. 
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The  entropy  H(P)  of  a  distribution  P  measures  the  ‘the  inherent  uncertainty  in  P\ 
or  (in  fact  equivalently),  ‘how  much  information  is  gained  when  an  outcome  of  P  is 
observed’.  To  make  this  more  precise,  let  us  imagine  an  observer  who  knows  that  X 
is  distributed  according  to  P.  The  observer  then  observes  X  =  x.  The  entropy  of  P 
stands  for  the  ‘uncertainty  of  the  observer  about  the  outcome  x  before  he  observes 
it’.  Now  think  of  the  observer  as  a  ‘receiver’  who  receives  the  message  conveying  the 
value  of  X.  From  this  dual  point  of  view,  the  entropy  stands  for 

the  average  amount  of  information  that  the  observer  has  gained  after 
receiving  a  realized  outcome  x  of  the  random  variable  X .  (*) 

Below,  we  first  give  Shannon’s  mathematical  definition  of  entropy,  and  we  then  con¬ 
nect  it  to  its  intuitive  meaning  (*). 

Definition  2.1 

Let  X  be  a  finite  or  countable  set,  letX  be  a  random  variable  taking  values  in  X  with 
distribution  P.  Then  the  (Shannon-)  entropy  of  random  variable  X  is  given  by 

H(X)  =  ~Y,Px  logPx,  (2.3) 

x$x 

Entropy  is  defined  here  as  a  functional  mapping  random  variables  to  real  numbers .  In 
many  texts ,  entropy  is ,  essentially  equivalently  defined  as  a  map  from  distributions 
of  random  variables  to  the  real  numbers.  Thus ,  by  definition:  H(P)  :=  H(X)  = 
-'Ex<=xPxlogpx. 


Motivation  Shannon’s  definition  can  be  motivated  in  several  different  ways.  The  two 
most  important  ones  are  the  axiomatic  approach  and  the  coding  interpretation.  In  this 
chapter  we  concentrate  on  the  latter,  but  we  first  briefly  sketch  the  former.  The  idea 
of  the  axiomatic  approach  is  to  postulate  a  small  set  of  eminently  reasonable  condi¬ 
tions  that  any  measure  of  information  relative  to  a  distribution  should  satisfy.  One  then 
shows  that  the  only  measure  satisfying  all  the  postulates  is  the  Shannon  entropy.  We 
outline  this  approach  for  finite  sources  X  =  {1,-**  ,  AT}.  We  look  for  a  function  H 
that  maps  probability  distributions  on  X  to  real  numbers.  For  given  distribution  P, 
H  (P)  should  measure  ‘how  much  information  is  gained  on  average  when  an  outcome 
is  made  available’.  We  can  write  H(P)  =  P(pi,  •  •  •  ,pjv)  where  Pi  stands  for  the 
probability  of  i.  Suppose  we  require  that 

1.  P(pi,-*-  ,p^v)  is  continuous  in  pi,  ••  •  ,p^. 

2.  If  all  the  pi  are  equal,  pi  =  1  /N>  then  H  should  be  a  monotonic  increasing 
function  of  N .  With  equally  likely  events  there  is  more  choice,  or  uncertainty, 
when  there  are  more  possible  events. 

3.  If  a  choice  is  broken  down  into  two  successive  choices,  the  original  H  should 
be  the  weighted  sum  of  the  individual  values  of  H.  Rather  than  formalizing 
this  condition,  we  will  give  a  specific  example.  Suppose  that  X  =  {1, 2, 3}, 
and  Pi  =  5,  P2  =  3,  Ps  =  §.  We  can  think  of  x  E  X  as  being  generated  in 
a  two-stage  process.  First,  an  outcome  in  Xf  =  {0, 1}  is  generated  according 
to  a  distribution  P'  with  pfQ  =  p[  =  5.  If  x*  =  1,  we  set  x  =  1  and  the 
process  stops.  If  xf  =  0,  then  outcome  ‘2’  is  generated  with  probability  2/3 
and  outcome  ‘3’  with  probability  1/3,  and  the  process  stops.  The  final  results 
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have  the  same  probabilities  as  before.  In  this  particular  case  we  require  that 

Thus,  the  entropy  of  P  must  be  equal  to  entropy  of  the  first  step  in  the  gener¬ 
ation  process,  plus  the  weighted  sum  (weighted  according  to  the  probabilities 
in  the  first  step)  of  the  entropies  of  the  second  step  in  the  generation  pro¬ 
cess.  As  a  special  case,  if  X  is  the  n-fold  product  space  of  another  space  y, 

X  =  (Yi,  •  •  •  ,  Yn)  and  the  Y*  are  all  independently  distributed  according  to 
Py,  then  H(Px)  =  nH(Py).  For  example,  the  total  entropy  of  n  independent 
tosses  of  a  coin  with  bias  p  is  n.H(p)  1  -  p). 

Remarkably,  Shannon  (1948)  proved  that 

Theorem  2.2 

The  only  H  satisfying  the  three  above  assumptions  is  of  the  form  H  =  —K  i  Pi  log  p%, 
with  K  a  constant. 

Thus,  requirements  (l)-(3)  lead  us  to  the  definition  of  entropy  (2.3)  given  above  up  to 
an  (unimportant)  scaling  factor.  We  shall  give  a  concrete  interpretation  of  this  factor 
later  on.  Besides  the  defining  characteristics  (l)-(3),  the  function  H  has  a  few  other 
properties  that  make  it  attractive  as  a  measure  of  information.  We  mention: 

4.  H (pi,  •  •  •  is  a  concave  function  of  the  pi. 

5.  For  each  N,  H  achieves  its  unique  maximum  for  the  uniform  distribution  Pi  = 

1/N . 

6.  H(pi ,  ■  •  •  ,  pn)  is  zero  if  one  of  the  pi  has  value  1 .  Thus,  H  is  zero  if  and  only 
if  we  do  not  gain  any  information  at  all  if  we  are  told  that  the  outcome  is  i 
(since  we  already  knew  i  would  take  place  with  certainty). 

We  note  that  there  do  exist  variations  of  ‘entropy’  which  violate  one  or  more  of  re¬ 
quirements  (l)-(3);  a  good  example  is  the  family  of  Renyi  entropies  [20].  While  such 
alternative  notions  of  entropy  are  useful  in  their  own,  restricted  context,  Shannon’s 
original  definition  remains  by  far  the  most  important. 

Coding  interpretation  Immediately  after  stating  theorem  2.2,  Shannon  continues  [102], 
‘this  theorem,  and  the  assumptions  required  for  its  proof,  are  in  no  way  necessary  for 
the  present  theory.  It  is  given  chiefly  to  provide  a  certain  plausibility  to  some  of  our 
later  definitions.  The  real  justification  of  these  definitions,  however,  will  reside  in  their 
implications’. 

Thus,  in  the  spirit  of  Shannon,  we  will  henceforth  concentrate  on  a  very  concrete  inter¬ 
pretation  of  entropy  in  terms  of  the  length  (number  of  bits)  needed  to  encode  outcomes 
in  X.  This  provides  much  clearer  intuitions;  it  lies  at  the  root  of  the  many  practical 
applications  of  information  theory,  and,  most  importantly  for  us,  it  simplifies  the  com¬ 
parison  to  Kolmogorov  complexity. 

Example  2.1 

We  start  with  an  example.  The  entropy  of  a  random  variable  X  with  equally  likely 
outcomes  in  a  finite  sample  space  X  is  given  by  H(X)  =  logA\  By  choosing  a 
particular  message  x  from  X,  we  remove  the  entropy  from  X  by  the  assignment  X  := 
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x  and  produce  or  transmit  information  I  =  log  X  by  our  selection  of  x.  We  show  below 
that  I  =  log  X  (or,  to  be  more  precise,  the  integer  V  =  [log  A'] )  can  be  interpreted  as 
the  number  of  bits  needed  to  be  transmitted  from  an  (imagined)  sender  to  an  (imagined) 
receiver.  0 

We  now  connect  entropy  to  minimum  average  code  lengths.  These  are  defined  as  fol¬ 
lows: 

Definition  2.2 

Let  source  words  x  E  {0, 1}*  be  produced  by  a  random  variable  X  with  probability 
P(x)  =  px  for  the  event  X  =  x .  The  characteristics  ofX  are  fixed .  Now  consider  pre¬ 
fix  codes  D  :  {0, 1}*  —>  Af  with  one  code  word  per  source  word  and  denote  the  length 
of  the  code  word  for  x  by  lx.  We  want  to  minimize  the  expected  number  of  bits  we 
have  to  transmit  for  the  given  source  X  and  choose  a  prefix  code  D  that  achieves  this. 
In  order  to  do  so,  we  must  minimize  the  average  code-word  length  L>d  =  Px^x-  We 
define  the  minimal  average  code  word  length  as  L  ~  min{Zx)  :  D  is  a  prefix-code} . 
A  prefix-code  D  such  that  Ljy  =  L  is  called  an  optimal  prefix-code  with  respect  to 
prior  probability  P  of  the  source  words. 

The  (minimal)  average  code  length  of  an  (optimal)  code  does  not  depend  on  the  details 
of  the  set  of  code  words,  but  only  on  the  set  of  code-word  lengths.  It  is  just  the  expected 
code-word  length  with  respect  to  the  given  distribution.  Shannon  discovered  that  the 
minimal  average  code  word  length  is  about  equal  to  the  entropy  of  the  source  word  set. 
This  is  known  as  the  noiseless  coding  theorem.  The  adjective  ‘noiseless’  emphasizes 
that  we  ignore  the  possibility  of  errors. 

Theorem  23 

Let  L  and  P  be  as  above .  If  H(P)  =  —  P*  JS  the  entropy,  then 

H(P)  <L<  H(P)  +  1.  (2.4) 


We  are  typically  interested  in  encoding  a  binary  string  of  length  n  with  entropy  pro¬ 
portional  to  n  (example  2.3).  The  essence  of  (2.4)  is  that,  for  all  but  the  smallest  n,  the 
difference  between  entropy  and  minimal  expected  code  length  is  completely  negligi¬ 
ble. 

It  turns  out  that  the  optimum  L  in  (2.4)  is  relatively  easy  to  achieve,  with  the  Shannon- 
Fano  code.  Let  there  be  N  symbols  (also  called  basic  messages  or  source  words). 
Order  these  symbols  according  to  decreasing  probability,  say  X  =  {1, 2,  •  •  •  ,N}  with 
probabilities  *  *  *  ,Pn-  Let  Pr  =  P*>  for  r  =  1,  •  •  •  ,  iV.  The  binary  code 

E  :  X  — >  {0, 1}*  is  obtained  by  coding  r  as  a  binary  number  E(r),  obtained  by 
truncating  the  binary  expansion  of  Pr  at  length  l(E(r))  such  that 

log pr  <  l(E(r))  <  1  -  logpr. 

This  code  is  the  Shannon-Fano  code.  It  has  the  property  that  highly  probable  symbols 
are  mapped  to  short  code  words  and  symbols  with  low  probability  are  mapped  to  longer 
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code  words  (just  like  in  a  less  optimal  setting  is  done  in  the  Morse  code).  Moreover, 

2-mr))  <  Pr  <  2-m r))+l 

Note  that  the  code  for  symbol  r  differs  from  all  codes  of  symbols  r  +  1  through  N  in 
one  or  more  bit  positions,  since  for  all  i  with  r  +  1  <  i  <  N, 

Pi  >Pr  + 

Therefore  the  binary  expansions  of  Pr  and  Pt  differ  in  the  first  l(E{r))  positions.  This 
means  that  E  is  one-to-one,  and  it  has  an  inverse:  the  decoding  mapping  E*1.  Even 
better,  since  no  value  of  E  is  a  prefix  of  any  other  value  of  E,  the  set  of  code  words  is 
a  prefix-code.  This  means  we  can  recover  the  source  message  from  the  code  message 
by  scanning  it  from  left  to  right  without  look-ahead.  If  Hi  is  the  average  number  of 
bits  used  per  symbol  of  an  original  message,  then  Hi  =  J2rprl(E(r)).  Combining 
this  with  the  previous  inequality  we  obtain  (2.4): 

-EprIogPr  <Hl<  D1  "  log Pr)Pr  =  1  ~  ^PrlogPr- 
r  r  r 

Interpretation  in  terms  of  sequential  questions  We  re-interpret  Shannon’s  noiseless 
coding  theorem  in  terms  of  protocols  for  sequentially  asking  questions:  suppose  that 
B  asks  questions  of  the  type  ‘Is  x  in  the  set  X'T,  where  X'  is  some  subset  of  X.  A 
answers  truthfully  to  each  question,  and  B  keeps  asking  questions  until  he  has  deter¬ 
mined  the  exact  value  of  the  realized  outcome  x  of  the  random  variable  X.  In  section 
2.2  we  showed  that  each  protocol  that  B  can  use  may  be  thought  of  as  a  prefix  code 
with  one  code  word  per  source  word,  and  vice  versa.  Therefore,  theorem  2.3  may  be 
interpreted  as  follows.  Suppose  it  is  B’s  goal  to  determine  the  exact  value  of  X  using 
as  few  questions  as  possible.  If  B  asks  his  questions  in  the  cleverest  possible  way,  he 
will  on  average  need  to  ask  H(X)  questions  (plus  or  minus  one)  to  find  out  the  exact 
value  of  X.  From  this  point  of  view,  the  Shannon-Fano  code  we  described  above  is  a 
protocol  for  asking  questions  that  is  ‘almost’  optimal,  where  the  ‘optimal’  protocol  is 
the  protocol  that  minimizes  the  expected  number  of  questions  to  be  asked. 

Problem  and  lacuna  Shannon  observes,  ‘Messages  have  meaning  [•  •  •  however  •  •  •  ] 
the  semantic  aspects  of  communication  are  irrelevant  to  the  engineering  problem’. 
Thus,  in  Shannon’s  theory  ‘information’  is  fully  determined  by  the  probability  dis¬ 
tribution  on  the  set  of  possible  messages,  and  unrelated  to  the  meaning,  structure  or 
content  of  individual  messages.  This  is  problematic  in  at  least  two  ways: 

First,  in  many  practical  cases,  the  distribution  generating  outcomes  may  be  unknown  to 
the  observer  or  (worse),  may  not  exist  at  all*.  For  example,  can  we  answer  a  question 
like  ‘what  is  the  information  in  this  book’  by  viewing  it  as  an  element  of  a  set  of 
possible  books  with  a  probability  distribution  on  it?  This  seems  unlikely.  And  how  to 
measure  the  quantity  of  hereditary  information  in  biological  organisms,  as  encoded  in 
DNA?  Again  there  is  the  possibility  of  seeing  a  particular  form  of  animal  as  one  of  a 
set  of  possible  forms  with  a  probability  distribution  on  it.  This  seems  to  be  contradicted 
by  the  fact  that  the  calculation  of  all  possible  lifeforms  in  existence  at  any  one  time  on 
earth  would  give  a  ridiculously  low  figure  like  2100. 

*  Even  if  we  adopt  a  Bayesian  (subjective)  interpretation  of  probability,  this  problem  remains 
[47]. 
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Shannon’s  classical  information  theory  assigns  a  quantity  of  information  to  an  ensem¬ 
ble  of  possible  messages.  All  messages  in  the  ensemble  being  equally  probable,  this 
quantity  is  the  number  of  bits  needed  to  count  all  possibilities.  This  expresses  the  fact 
that  each  message  in  the  ensemble  can  be  communicated  using  this  number  of  bits. 
However,  it  does  not  say  anything  about  the  number  of  bits  needed  to  convey  any  in¬ 
dividual  message  in  the  ensemble,  and  this  constitutes  a  second  ‘lacuna’  of  Shannon’s 
theory.  To  illustrate  this,  consider  the  ensemble  consisting  of  all  binary  strings  of  length 
9999999999999999  gy  Shannon’s  measure,  we  require  9999999999999999  bits  on 
the  average  to  encode  a  string  in  such  an  ensemble.  However,  the  string  consisting  of 
9999999999999999  l’s  can  be  encoded  in  about  55  bits  by  expressing  9999999999999999 
in  binary  and  adding  the  repeated  pattern  ‘1’.  A  requirement  for  this  to  work  is  that 
we  have  agreed  on  an  algorithm  that  decodes  the  encoded  string.  We  can  compress  the 
string  still  further  when  we  note  that  9999999999999999  equals  32xllllllllllllllll, 
and  that  1111111111111111  consists  of  24  l’s. 

Thus,  we  have  discovered  an  interesting  phenomenon:  the  description  of  some  strings 
can  be  compressed  considerably,  provided  they  exhibit  enough  regularity.  However, 
if  regularity  is  lacking,  it  becomes  more  cumbersome  to  express  large  numbers.  For 
instance,  it  seems  easier  to  compress  the  number  ‘one  billion’,  than  the  number  ‘one 
billion  seven  hundred  thirty-five  million  two  hundred  sixty-eight  thousand  and  three 
hundred  ninety-four’,  even  though  they  are  of  the  same  order  of  magnitude. 

We  are  interested  in  a  measure  of  information  that,  unlike  Shannon’s,  does  not  rely 
on  (often  untenable)  probabilistic  assumptions,  and  that  takes  into  account  the  phe¬ 
nomenon  that  ‘regular’  strings  are  compressible.  Thus,  we  aim  for  a  measure  of  infor¬ 
mation  content  of  an  individual  finite  object,  and  in  the  information  conveyed  about 
an  individual  finite  object  by  another  individual  finite  object.  Here,  we  want  the  in¬ 
formation  content  of  an  object  a:  to  be  an  attribute  of  x  alone,  and  not  to  depend  on, 
for  instance,  the  means  chosen  to  describe  this  information  content.  Surprisingly,  this 
turns  out  to  be  possible,  at  least  to  a  large  extent.  The  resulting  theory  of  information 
is  based  on  Kolmogorov  complexity,  a  notion  independently  proposed  by  Solomonoff 
[106],  Kolmogorov  [65]  and  Chaitin  [16];  Li  and  Vitanyi  [71]  describe  the  history  of 
the  subject. 

2.3.2  Kolmogorov  complexity 

Suppose  we  want  to  describe  a  given  object  by  a  finite  binary  string.  We  do  not  care 
whether  the  object  has  many  descriptions;  however,  each  description  should  describe 
only  one  object.  From  among  all  descriptions  of  an  object  we  can  take  the  length  of 
the  shortest  description  as  a  measure  of  the  object’s  complexity.  It  is  natural  to  call  an 
object  ‘simple’  if  it  has  at  least  one  short  description,  and  to  call  it  ‘complex’  if  all  of 
its  descriptions  are  long. 

As  in  section  2.2,  consider  a  description  method  D ,  to  be  used  to  transmit  messages 
from  a  sender  to  a  receiver.  If  D  is  known  to  both  a  sender  and  receiver,  then  a  message 
x  can  be  transmitted  from  sender  to  receiver  by  transmitting  the  description  y  with 
D(y)  =  x.  The  cost  of  this  transmission  is  measured  by  Z(y),  the  length  of  y.  The  least 
cost  of  transmission  of  x  is  determined  by  the  length  function  L(x):  recall  that  L(x)  is 
the  length  of  the  shortest  y  such  that  D(y)  —  x.  We  choose  this  length  function  as  the 
descriptional  complexity  of  x  under  specification  method  D . 
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Obviously,  this  descriptional  complexity  of  x  depends  crucially  on  D.  The  general 
principle  involved  is  that  the  syntactic  framework  of  the  description  language  deter¬ 
mines  the  succinctness  of  description. 

In  order  to  objectively  compare  descriptional  complexities  of  objects,  to  be  able  to  say 
‘ x  is  more  complex  than  z\  the  descriptional  complexity  of  x  should  depend  on  x 
alone.  This  complexity  can  be  viewed  as  related  to  a  universal  description  method  that 
is  a  priori  assumed  by  all  senders  and  receivers.  This  complexity  is  optimal  if  no  other 
description  method  assigns  a  lower  complexity  to  any  object. 

We  are  not  really  interested  in  optimality  with  respect  to  all  description  methods.  For 
specifications  to  be  useful  at  all  it  is  necessary  that  the  mapping  from  y  to  D(y)  can 
be  executed  in  an  effective  manner.  That  is,  it  can  at  least  in  principle  be  performed  by 
humans  or  machines.  This  notion  has  been  formalized  as  that  of  ‘partial  recursive  func¬ 
tions’,  also  known  simply  as  computable  functions  (by  Turing  machines).  According 
to  generally  accepted  mathematical  viewpoints  it  coincides  with  the  intuitive  notion  of 
effective  computation. 

The  set  of  partial  recursive  functions  contains  an  optimal  function  that  minimizes  de¬ 
scription  length  of  every  other  such  function.  We  denote  this  function  by  Do-  Namely, 
for  any  other  recursive  function  D,  for  all  objects  x ,  there  is  a  description  y  of  x  under 
D0  that  is  shorter  than  any  description  z  of  x  under  D.  (That  is,  shorter  up  to  an  addi¬ 
tive  constant  that  is  independent  of  x).  Complexity  with  respect  to  Dq  minimizes  the 
complexities  with  respect  to  all  partial  recursive  functions. 

We  identify  the  length  of  the  description  of  x  with  respect  to  a  fixed  specification  func¬ 
tion  Dq  with  the  ‘algorithmic  (descriptional)  complexity’  of  x.  The  optimality  of  Dq  in 
the  sense  above  means  that  the  complexity  of  an  object  x  is  invariant  (up  to  an  additive 
constant  independent  of  x)  under  transition  from  one  optimal  specification  function 
to  another.  Its  complexity  is  an  objective  attribute  of  the  described  object  alone:  it  is 
an  intrinsic  property  of  that  object,  and  it  does  not  depend  on  the  description  formal¬ 
ism.  This  complexity  can  be  viewed  as  ‘absolute  information  content’:  the  amount  of 
information  that  needs  to  be  transmitted  between  all  senders  and  receivers  when  they 
communicate  the  message  in  absence  of  any  other  a  priori  knowledge  that  restricts  the 
domain  of  the  message.  Thus,  we  have  outlined  the  program  for  a  general  theory  of 
algorithmic  complexity.  The  three  major  innovations  are  as  follows: 

1.  In  restricting  ourselves  to  formally  effective  descriptions,  our  definition  cov¬ 
ers  every  form  of  description  that  is  intuitively  acceptable  as  being  effective 
according  to  general  viewpoints  in  mathematics  and  logic. 

2.  The  restriction  to  effective  descriptions  entails  that  there  is  a  universal  descrip¬ 
tion  method  that  minimizes  the  description  length  or  complexity  with  respect 
to  any  other  effective  description  method.  Significantly,  this  implies  item  3. 

3.  The  description  length  or  complexity  of  an  object  is  an  intrinsic  attribute  of 
the  object  independent  of  the  particular  description  method  or  formalizations 
thereof. 

2.3.2. 1  Formal  details 

The  Kolmogorov  complexity  K(x)  of  a  finite  object  x  will  be  defined  as  the  length  of 
the  shortest  effective  binary  description  of  x.  Broadly  speaking,  K(x)  may  be  thought 
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of  as  the  length  of  the  shortest  computer  program  that  prints  x  and  then  halts.  This 
computer  program  may  be  written  in  C,  Java,  LISP  or  any  other  universal  language: 
we  shall  see  that,  for  any  two  universal  languages,  the  resulting  program  lengths  differ 
at  most  by  a  constant  not  depending  on  x. 

To  make  this  precise,  let  fa,  T2,  *  •  •  be  a  standard  enumeration  of  all  Turing  machines, 
and  let  ^1,  </>2,  *  •  *  be  the  enumeration  of  corresponding  functions  which  are  computed 
by  the  respective  Turing  machines.  That  is,  fa  computes  fa.  These  functions  are  the 
partial  recursive  functions  or  computable  functions.  For  technical  reasons  we  are  inter¬ 
ested  in  the  so-called  prefix  complexity,  which  is  associated  with  Turing  machines  for 
which  the  set  of  programs  (inputs)  resulting  in  a  halting  computation  is  prefix  free^. 
We  can  realize  this  by  equipping  the  Turing  machine  with  a  one-way  input  tape,  a  sep¬ 
arate  work  tape,  and  a  one-way  output  tape.  Such  Turing  machines  are  called  prefix 
machines  since  the  halting  programs  for  any  one  of  them  form  a  prefix  free  set. 

We  first  define  Kt.(x ),  the  prefix  Kolmogorov  complexity  of  x  relative  to  a  given 
prefix  machine  T where  fa  is  the  z-th  prefix  machine  in  a  standard  enumeration  of 
them.  Kt{  (x)  is  defined  as  the  length  of  the  shortest  input  sequence  y  such  that  fa{y)  = 
fa(y)  =  x.  If  no  such  input  sequence  exists,  (x)  remains  undefined.  Of  course,  this 

preliminary  definition  is  still  highly  sensitive  to  the  particular  prefix  machine  fa  that 
we  use.  But  now  the  ‘universal  prefix  machine’  comes  to  our  rescue.  Just  as  there  exists 
universal  ordinary  Turing  machines,  there  also  exist  universal  prefix  machines.  These 
have  the  remarkable  property  that  they  can  simulate  every  other  prefix  machine.  More 
specifically,  there  exists  a  prefix  machine  U  such  that,  with  as  input  the  pair  (i,  y),  it 
outputs  fa(y)  and  then  halts.  We  now  fix,  once  and  for  all,  a  prefix  machine  U  with 
this  property  and  call  U  the  reference  machine.  The  Kolmogorov  complexity  K (x)  of 
x  is  defined  as  Ku(x). 

Let  us  formalize  this  definition.  Let  (•)  be  a  standard  invertible  effective  one-one  en¬ 
coding  from  Jsf  x  M  to  a  prefix-free  subset  of  Af.  (■)  may  be  thought  of  as  the  encoding 
function  of  a  prefix  code.  For  example,  we  can  set  (x,  y)  =  x  fy*. 

We  insist  on  prefix-freeness  and  recursiveness  (i.e.  partial  recursive  functions)  because 
we  want  a  universal  Turing  machine  to  be  able  to  read  an  image  under  {•)  from  left  to 
right  and  determine  where  it  ends. 

Definition  2.3 

Let  U  be  our  reference  prefix  machine,  i.e.  for  all  i  G  J\f,  y  e  {0, 1}*,  U({i,  y))  = 
fa{y).  The  prefix  Kolmogorov  complexity  of  x  is 

K(x)  =  min {l(z)  :  U(z)  =  x\  z  G  {0, 1}*} 

=  min {l«i,y»  :  fa(y)  =  x,y  €  {0,  l}*,i  €  Af}.  (2.5) 

hy 

We  can  alternatively  think  ofz  as  a  program  that  prints  x  and  then  halts ,  orasz  =  (i}  y) 
where  y  is  a  program  such  that,  when  fa  is  input  program  y,  it  prints  x  and  then  halts. 

Thus,  by  definition  K(x )  =  l(x*),  where  x*  is  the  lexicographically  first  shortest 
self-delimiting  (prefix)  program  for  x  with  respect  to  the  reference  prefix  machine. 

1  There  exists  a  version  of  Kolmogorov  complexity  corresponding  to  programs  that  are  not 
necessarily  prefix-free,  but  we  will  not  go  into  it  here. 
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Consider  the  mapping  E*  defined  by  E*(x)  =  x*.  This  may  be  viewed  as  the  en¬ 
coding  function  of  a  prefix-code  (decoding  function)  D*  with  D*(x*)  =  x.  By  its 
definition,  D*  is  a  very  parsimonious  code.  The  reason  for  working  with  prefix  rather 
than  standard  Turing  machines  is  that,  for  many  of  the  subsequent  developments,  we 
need  D*  to  be  prefix. 

Though  defined  in  terms  of  a  particular  machine  model,  the  Kolmogorov  complex¬ 
ity  is  machine-independent  up  to  an  additive  constant  and  acquires  an  asymptotically 
universal  and  absolute  character  through  Church’s  thesis  ([71],  p.  29),  from  the  abil¬ 
ity  of  universal  machines  to  simulate  one  another  and  execute  any  effective  process. 
The  Kolmogorov  complexity  of  an  object  can  be  viewed  as  an  absolute  and  objective 
quantification  of  the  amount  of  information  in  it. 

Example  2.2 

To  develop  some  intuitions,  it  is  useful  to  think  of  K(x)  as  the  shortest  program  for 
x  in  some  standard  programming  language  such  as  LISP  or  Java.  Consider  the  lexi¬ 
cographical  enumeration  of  all  syntactically  correct  LISP  programs  Ai,  A2,  *  *  •  ,  and 
the  lexicographical  enumeration  of  all  syntactically  correct  Java  programs  71*1, 7T2,  *  •  • . 
We  assume  that  both  these  programs  are  encoded  in  some  standard  prefix-free  manner. 
With  proper  definitions  we  can  view  the  programs  in  both  enumerations  as  comput¬ 
ing  partial  recursive  functions  from  their  inputs  to  their  outputs.  Choosing  reference 
machines  in  both  enumerations  we  can  define  complexities  if  LISP  0*0  and  ifjava(^) 
completely  analogous  to  if  (re).  All  of  these  measures  of  the  descriptional  complexities 
of  x  coincide  up  to  a  fixed  additive  constant.  Let  us  show  this  directly  for  K lisp (x) 
and  ifjava(x).  Since  LISP  is  universal,  there  exists  a  LISP  program  A p  implement¬ 
ing  a  Java-to-LISP  compiler.  A p  translates  each  Java  program  to  an  equivalent  LISP 
program.  Consequently,  for  all  x ,  ifLisp(^)  <  ^fjava(^)  +  2 l(P).  Similarly,  there 
is  a  Java  program  7 17,  that  is  a  LISP-to-Java  compiler,  so  that  for  all  x ,  ifjava(x)  < 
ifuspfa)  +  2  l(L).  It  follows  that  lifjava^)  ~  if  LISP  0*01  <  2/(P)  +  2  l(L)  for  all  x\ 

The  programming  language  view  immediately  tells  us  that  if  (x)  must  be  small  for 
‘simple’  or  ‘regular’  objects  x.  For  example,  there  exists  a  fixed-size  program  that, 
when  input  n,  outputs  the  first  n  bits  of  n  and  then  halts.  Specification  of  n  takes  at 
most  Lj^f{n)  =  log  n+2  log  log  n+1  bits.  Thus,  if  x  consists  of  the  first  n  binary  digits 

of  7 r,  then  if  ( x )  <  log  n  +  2  log  log  n.  Similarly,  if  0n  denotes  the  string  consisting  of 
n  0’s,  then  if  (0n)  <  log  n  2  log  log  n. 

On  the  other  hand,  for  all  x ,  there  exists  a  program  ‘print  x;  halt’.  This  shows  that 

for  all  if  (x)  <  l(x).  As  was  previously  noted,  for  any  prefix  code,  there  are  no  more 
than  2m  strings  x  which  can  be  described  by  m  or  less  bits.  In  particular,  this  holds  for 
the  prefix  code  P*  whose  length  function  is  K(x).  Thus,  the  fraction  of  strings  x  of 
length  n  with  K(x)  <  m  is  at  most  2m“n:  the  overwhelming  majority  of  sequences 
cannot  be  compressed  by  more  than  a  constant.  Specifically,  if  x  is  determined  by  n 
independent  tosses  of  a  fair  coin,  then  with  overwhelming  probability,  K(x)  &  l(x). 
Thus,  while  for  very  regular  strings,  the  Kolmogorov  complexity  is  small  (sublinear  in 
the  length  of  the  string),  most  strings  are  ‘random’  and  have  Kolmogorov  complexity 
about  equal  to  their  own  length.  0 
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Problem  and  lacuna  Unfortunately  K (x)  is  not  a  recursive  function:  the  Kolmogorov 
complexity  is  not  computable  in  general.  This  means  that  there  exists  no  computer 
program  that,  when  input  an  arbitrary  string,  outputs  the  Kolmogorov  complexity  of 
that  string  and  then  halts.  This  follows  from  Godel’s  theorem  of  incompleteness  [71]. 
While  there  exist  ‘feasible’,  resource-bounded  forms  of  Kolmogorov  complexity  [71], 
these  lack  some  of  the  elegant  properties  of  the  original,  uncomputable  notion. 

Now  suppose  we  are  interested  in  efficient  storage  and  transmission  of  long  sequences 
of  data.  According  to  Kolmogorov,  we  can  compress  such  sequences  in  an  essentially 
optimal  way  by  storing  or  transmitting  the  shortest  program  that  generates  them.  Un¬ 
fortunately,  as  we  have  just  seen,  we  cannot  find  such  a  program  in  general.  According 
to  Shannon,  we  can  compress  such  sequences  optimally  in  an  average  sense  (and  there¬ 
fore,  it  turns  out,  also  with  high  probability)  if  they  are  distributed  according  to  some 
P  and  we  know  P.  Unfortunately,  in  practice,  P  is  often  unknown  or  even  nonexis¬ 
tent.  Thus,  both  Shannon’s  and  Kolmogorov’s  idea  are  not  directly  applicable  to  most 
actual  data  compression  problems.  For  these,  we  can  use  universal  codes  which  may 
be  viewed  at  the  same  time  as  an  extension  of  Shannon’s,  and  a  ‘downscaling’  of  Kol¬ 
mogorov’s  theory. 


2.4  Universal  coding:  interpolating  between  Kolmogorov  and  Shannon 

Below  we  repeatedly  use  the  coding  concepts  introduced  in  section  2.2.  Suppose  we 
are  given  a  recursive  enumeration  of  prefix  codes  Di,  Z>2?  •  •  • .  Let  L\ ,  L2,  •  •  •  be  the 
length  functions  associated  with  these  codes.  That  is,  L*(x)  =  min{Z(y)  :  Di(y)  = 
x};  if  there  exists  no  y  with  Di(y)  —  x,  then  L*(y)  =  1.  We  may  encode  x  by  first 
encoding  a  natural  number  k  using  the  standard  prefix  code  for  the  natural  numbers. 
We  then  encode  x  itself  using  the  code  D fc.  This  leads  to  a  so-called  two-part  code  D 
with  lengths  L.  By  construction,  this  code  is  prefix  and  its  lengths  satisfy 

L(x)  :=  mm  L^(k)  +  Lk(x).  (2.6) 

Let  x  be  an  infinite  binary  sequence  and  let  £[1:n]  G  {0,  l}n  be  the  initial  n-bit  segment 
of  this  sequence.  Since  Lj^(k)  =  0(log  fc),  we  have  for  all  fc,  all  n : 


Hx[l :n])  <  Lk(x[l:n])  +  OQogk). 


Recall  that  for  each  fixed  Lfc,  the  fraction  of  sequences  of  length  n  that  can  be  com¬ 
pressed  by  more  than  m  bits  is  less  than  2~m.  Thus,  typically,  the  codes  L and  the 
strings  X[1;n]  will  be  such  that  L^(x[1:n])  grows  linearly  with  n.  This  implies  that  for 
every  x,  the  newly  constructed  L  is  ‘almost  as  good’  as  whatever  code  D ^  in  the  list 
is  best  for  that  particular  x:  the  difference  in  code  lengths  is  bounded  by  a  constant 
depending  on  k  but  not  on  n.  In  particular,  for  each  k  and  each  infinite  sequence  x, 


lim 

n—voo 


jfofrn]) 

Lk(X[hn]) 


<  1. 


(2.7) 


A  code  satisfying  (2.7)  is  called  a  universal  code  relative  to  the  comparison  class  of 
codes  {Di,  I>2,  *  *  * }.  It  is  ‘universal’  in  the  sense  that  it  compresses  every  sequence 
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essentially  as  well  as  the  D that  compresses  that  particular  sequence  the  most.  In 
general,  there  exist  many  types  of  codes  that  are  universal:  the  2-part  universal  code 
defined  above  is  just  one  means  of  achieving  (2.7). 

Universal  codes  and  Kolmogorov  In  most  practically  interesting  cases  we  may  as¬ 
sume  that  for  all  k ,  the  decoding  function  D is  computable,  i.e.  there  exists  a  prefix 
Turing  machine  which  for  all  y  G  {0, 1}*,  when  input  y'  (the  prefix-free  version  of  y\ 
outputs  Dk{y)  and  then  halts.  Since  such  a  program  has  finite  length,  we  must  have  for 

all  K 

l(E  (#[i;n]))  7f(xji;n])  <:  -^fc(X[i:n]), 

where  E *  is  the  encoding  function  defined  earlier,  with  l(E*(x))  =  K(x ).  Comparing 
with  (2.7)  shows  that  the  code  D*  with  encoding  function  E *  is  a  universal  code  rela¬ 
tive  to  D\ ,  Z?2j  *  *  • .  Thus,  we  see  that  the  Kolmogorov  complexity  K  is  just  the  length 
function  of  the  universal  code  D*.  Note  that  D*  is  an  example  of  a  universal  code  that 
is  not  (explicitly)  two-part. 


Example  2.3 

Let  us  create  a  universal  two-part  code  that  allows  us  to  significantly  compress  all 
binary  strings  with  frequency  of  0’s  deviating  significantly  from  For  no  <  ni,  let 
no)  be  the  code  that  assigns  code  words  of  equal  (minimum)  length  to  all  strings 
of  length  n  with  no  zeroes,  and  no  code  words  to  any  other  strings.  Then  D (njno)  is  a 
prefix-code  and  L(n>no)  (x)  =  [log  (™Q)]  •  The  universal  two  part  code  D  relative  to  the 
set  of  codes  { D ^  :  i,j  G  J\f }  then  achieves  the  following  lengths  (to  within  1  bit): 
for  all  n,  all  no  G  {0,  •  *  •  ,  n},  all  with  no  zeroes, 

L (a?[1;nj )  =  log  n  +  log  n0  +  2  log  log  n  +  2  log  log  n0  +  log 

=  log  +  O(logn).  (2.8) 


Using  Stirling’s  approximation  of  the  factorial,  n!  ~  nu  exp(-n)v/27rn,  we  find  that 


log 


=  log n!  -  log n0!  -  log(n  -  n0)! 

=  nlogn  —  no  log  no  —  (n  —  no)  log(n  -  no)  +  (9(logn) 
=  ni7(no/n)  +  O(logn). 


(2.9) 


Note  that  H(no/n)  <  1,  with  equality  if  n0  =  n.  Therefore,  if  the  frequency  deviates 
significantly  from  D  compresses  X[1:nj  by  an  factor  linear  in  n.  In  all  such  cases, 
D*  compresses  the  data  by  at  least  the  same  linear  factor.  Note  that  (a)  each  individual 
code  is  capable  of  exploiting  a  particular  type  of  regularity  in  a  sequence  to 

compress  that  sequence,  (b)  the  universal  code  D  may  exploit  many  different  types 
of  regularities  to  compress  a  sequence,  and  (c)  the  code  D*  with  lengths  given  by  the 
Kolmogorov  complexity  asymptotically  exploits  all  computable  regularities  so  as  to 
maximally  compress  a  sequence.  0 


Universal  codes  and  Shannon  If  X  is  distributed  according  to  some  distribution  P, 
then  the  optimal  (in  the  average  sense)  code  to  use  is  the  Shannon-Fano  code.  But  now 


suppose  it  is  only  known  that  P  E  P,  where  V  is  some  given  (possibly  very  large,  e.g. 
uncountable)  set  of  candidate  distributions.  Now  it  is  not  clear  what  code  is  optimal. 
We  may  try  the  Shannon-Fano  code  for  a  particular  P  E  P,  but  such  a  code  will  typi¬ 
cally  lead  to  very  large  expected  code  lengths  if  X  turns  out  to  be  distributed  according 
to  some  Q  €V,Q  ^  P.  We  may  ask  whether  there  exists  another  code  that  is  ‘almost’ 
as  good  as  the  Shannon-Fano  code  for  P,  no  matter  what  P  e  V  actually  generates 
the  sequence?  We  now  show  that,  provided  V  is  finite  or  countable,  then  (perhaps  sur¬ 
prisingly),  the  answer  is  yes.  To  see  this,  we  need  the  notion  of  an  information  source. 
An  information  source  may  be  thought  of  as  a  probability  distribution  over  arbitrarily 
long  sequences,  of  which  an  observer  gets  to  see  longer  and  longer  initial  segments; 
examples  are  given  below.  Formally,  an  information  source  P  is  a  probability  distribu¬ 
tion  on  the  set  {0, 1}°°  of  one-way  infinite  sequences.  Such  a  P  can  be  identified  with 
the  distributions  P W  on  {0,  l}1,  P^  on  {0,  l}2,  • .  Here  P denotes  the  marginal 

distribution  of  P  on  the  first  n-bit  segments.  P(n)  is  related  to  pi71^1)  as  follows:  for 
all  n  >  0,  all  x  e  {0,  l}n,  X!ye{0,i}  P(n+lHxv)  =  P(n\x)  and  P(°\x)  =  1. 

Suppose  then  that  V  is  a  finite  or  countable  set  of  information  sources.  Then  the  mem¬ 
bers  of  V  may  be  listed  as  Pi,  P2,  •  *  • .  To  each  marginal  distribution  P^\  there 
corresponds  a  unique  Shannon-Fano  code  defined  on  the  set  {0,  l}n  with  lengths 

L(n ,,*>(*)  :=  r-log-Pfcn)(z)l. 

For  given  P  E  P,  we  define 

H(PW)  :=  J2  ^(n)Wr-logP(n)(x)l, 

xe{o,i}n 


as  the  entropy  of  the  distribution  of  the  first  n  outcomes. 


Let  £  be  a  prefix-code  assigning  code  word  E(x)  to  source  word  x  e  {0,1}". 
The  noiseless  coding  theorem  2.3  asserts  that  the  minimal  average  code  word  length 
L(P)  =  ^€{0,1}”  P(x)l(E(x))  among  all  such  prefix-codes  E  satisfies 


ff(pW)  <  L(P)  <  H(pW)  +  1. 


The  entropy  H{P 00)  can  therefore  be  interpreted  as  the  expected  code  length  of  en- 
coding  the  first  n  bits  generated  by  the  source  P,  when  the  optimal  (Shannon-Fano) 
code  is  used.  We  look  for  a  prefix  code  D  with  length  function  L  that  satisfies,  for  all 
PeV: 


..  EpL(Xll:n]) 

™  tf(P(n)) 


<1, 


where  EPL(X^.nj)  —  Sie{o,i}n  P^n\x)L(x).  Define  D  as  the  following  two-part 
code:  first,  n  is  encoded  using  the  standard  prefix  code  for  natural  numbers.  Then, 
among  all  codes  -D(n,fc)>  the  k  that  minimizes  L^n  ^{x)  is  encoded  (again  using  the 
standard  prefix  code);  finally,  x  is  encoded  in  P(n,fc)  (x)  bits.  Then  for  all  n,  for  all  k, 
for  every  sequence  sr1:ni. 


L(x[l :n])  ^  L(n,k)(x[l:n})  +  A/V(k)  +  A/V(n)-  (2.10) 


Since  (2.10)  holds  for  all  strings  of  length  n,  it  must  also  hold  in  expectation  for  all 
possible  distributions  on  strings  of  length  n.  In  particular,  this  gives,  for  all  k  G  J\f, 


m  T  (V  \  ^  V  T  /V  ,,  _L  nt\  r\cr  —  p(n)\  ,  nnna„\ 
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from  which  (2.4)  follows. 

Historically,  codes  satisfying  (2.4)  have  been  called  universal  codes  relative  to  V\ 
codes  satisfying  (2.7)  have  been  considered  in  the  literature  only  much  more  recently 
and  are  usually  called  ‘universal  codes  for  individual  sequences’  [77].  The  two-part 
code  D  that  we  just  defined  is  universal  both  in  an  individual  sequence  and  in  an  av¬ 
erage  sense:  D  achieves  code  lengths  within  a  constant  of  that  achieved  by 
for  every  individual  sequence,  for  every  k  €  Af ;  but  D  also  achieves  expected  code 
lengths  within  a  constant  of  the  Shannon-Fano  code  for  P,  for  every  P  E  V.  We  may 
say  that  D  interpolates  between  Shannon’s  codes,  which  are  optimal  for  a  specific  P, 
and  Kolmogorov’s  code  D*  (with  length  function  K ),  which  by  definition  does  at  least 
as  well  (within  an  additive  constant)  as  D. 

Example  2.4 

Suppose  our  sequence  is  generated  by  independent  tosses  of  a  coin  with  bias  p  of 
tossing  ‘head’  where  p  G  (0, 1).  Identifying  ‘heads’  with  1,  the  probability  of  n  -  no 
outcomes  ‘1’  in  an  initial  segment  rc[1;n]  is  then  (1  -  p)nvpn~nom  Let  V  be  the  set  of 
corresponding  information  sources,  containing  one  element  for  each  p  G  (0, 1).  V  is 
an  uncountable  set;  nevertheless,  a  universal  code  for  V  exists.  In  fact,  it  can  be  shown 
that  the  code  D  with  lengths  (2.10)  in  example  2.3  is  universal  for  P,  i.e.  it  satisfies 
(2.4).  The  reason  for  this  is  (roughly)  as  follows:  if  data  are  generated  by  a  coin  with 
bias  p,  then  with  probability  1,  the  frequency  no/n  converges  to  p,  so  that,  by  (2.10), 
n_1L(x[1:n])  tends  to  n~lH(P^)  =  77  (p,  1  —  p). 

If  we  are  interested  in  practical  data-compression,  then  the  assumption  that  the  data 
are  generated  by  a  biased-coin  source  is  very  restricted.  But  there  are  much  richer 
classes  of  distributions  V  for  which  we  can  formulate  universal  codes.  For  example, 
we  can  take  V  to  be  the  class  of  all  Markov  sources  of  each  order;  here  the  probability 
that  X{  —  1  may  depend  on  arbitrarily  many  earlier  outcomes.  Such  ideas  form  the 
basis  of  most  data  compression  schemes  used  in  practice.  Codes  which  are  universal 
for  the  class  of  all  Markov  sources  of  each  order  and  which  encode  and  decode  in 
real-time  can  easily  be  implemented.  Thus,  while  we  cannot  find  the  shortest  program 
that  generates  a  particular  sequence,  it  is  often  possible  to  effectively  find  the  shortest 
encoding  within  a  quite  sophisticated  class  of  codes.  0 

Expected  Kolmogorov  complexity  =  Shannon  entropy  Suppose  the  source  words  x 
are  distributed  as  a  random  variable  X  with  probability  P{x).  While  K(x)  is  fixed  for 
each  x  and  gives  the  shortest  code  word  length  (but  only  up  to  a  fixed  constant)  and  is 
independent  of  the  probability  distribution  P,  we  may  wonder  whether  K  is  also  uni¬ 
versal  in  the  following  sense:  If  we  weigh  each  individual  code  word  length  for  x  with 
its  probability  P(x),  thus  the  resulting  P-ex pected  code  word  length  J2X  P(x)K(x) 
achieve  the  minimal  average  code  word  length  H{P)  =  —  Y^x  P(%)  log  P(x) ?  Here 
we  sum  over  the  entire  support  of  P;  restricting  summation  to  a  small  set,  for  example 
the  singleton  set  {.t},  can  give  a  different  result.  The  reasoning  above  implies  that,  un¬ 
der  some  mild  restrictions  on  the  distributions  P,  the  answer  is  yes.  This  is  expressed 
in  the  following  theorem,  where,  instead  of  the  quotient  we  look  at  the  difference  of 
Yx  P(%)K(x)  and  77(P).  This  allows  us  to  express  really  small  distinctions.  We  call 
an  information  source  P  recursive  if  there  exists  a  Turing  machine  that,  when  input 
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(n,  y)  with  x  E  {0, 1}*  and  y,  n  E  Af,  outputs  P(n)(x)  to  precision  The  follow¬ 
ing  theorem  can  be  found  in  [71]. 

Theorem  2.4 

Let  P  be  a  recursive  information  source.  Then  for  all  n, 

0  <  P{n)(x)K(x)~  H(P^)  <cp, 

xG{0,l}n 

where  cp  is  a  constant  that  depends  only  on  P  (and  not  on  n). 

The  Shannon-Fano  code  for  a  computable  distribution  is  itself  computable.  Therefore, 
for  every  computable  distribution  P,  the  universal  code  D*  whose  length  function  is 
the  Kolmogorov  complexity  compresses  on  average  at  least  as  much  as  the  Shannon- 
Fano  code  for  P.  This  is  the  intuitive  reason  why,  no  matter  what  computable  distribu¬ 
tion  P  we  take,  its  expected  Kolmogorov  complexity  is  close  to  its  entropy. 


2.5  Mutual  information 
2.5.1  Shannon  mutual  information 

How  much  information  can  a  random  variable  X  convey  about  a  random  variable 
Y ?  Taking  a  purely  combinatorial  approach,  this  notion  is  captured  as  follows.  If  X 
ranges  over  Sx  and  Y  ranges  over  Sy,  then  we  look  at  the  set  U  of  possible  events 
(X  =  a,  Y  =  b)  consisting  of  joint  occurrences  of  event  X  =  a  and  event  Y  =  b. 
If  U  does  not  equal  the  Cartesian  product  Sx  x  5y,  then  this  means  there  is  some 
dependency  between  X  and  Y .  Considering  the  set  Ua  =  {(a,  y)  :  (a,  y)  E  U} 
for  a  E  Sx>  it  is  natural  to  define  the  conditional  entropy  of  Y  given  X  =  a  as 
H(Y\X  —  a)  —  logd(Ua).  This  suggests  immediately  that  the  information  given  by 
X  =  a  about  Y  is 

I(X  =  a:Y)  =  H(Y)H(Y\X  =  a). 

For  example,  if  XJ  =  {(1, 1),  (1, 2),  (2, 3)},  U  C  Sx  x  Sy  with  =  {1, 2}  and 
5y  =  {1,  2, 3, 4},  then  I(X  =  1  :  Y)  =  1  and  7(X  =  2  :  F)  =  2. 

In  this  formulation  it  is  obvious  that  H(X\X  =  a)  =  0,  and  that  /(X  =  a  :  X)  = 
H(X).  This  approach  amounts  to  the  assumption  of  uniform  distribution  of  the  proba¬ 
bilities  concerned.  We  can  generalize  this  approach,  taking  into  account  the  frequencies 
or  probabilities  of  the  occurrences  of  the  different  values  X  and  Y  can  assume.  Let  the 
joint  probability  p(a,  b )  be  defined  as:  ‘the  probability  of  the  joint  occurrence  of  event 
X  =  a  and  event  Y  =  b\  This  leads  to  the  self-evident  formulas  for  joint  variables 
X,Y: 


H(X,  Y)  -  —  p(a,  b)  log p(a,b), 

a,  b 

H{X )  =  -  p(a,  b)  log  p(a,  b), 

a,6  6 

H(Y)  =  -£pM)1osEpM)’ 
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where  summation  over  a  is  taken  over  all  outcomes  of  the  random  variable  X  and 
summation  over  b  is  taken  over  all  outcomes  of  random  variable  Y.  One  can  show  that 


H(X,  Y)  <  H(X)  +  H(Y), 


(2.11) 


with  equality  only  in  the  case  that  X  and  Y  are  independent.  In  all  of  these  equations 
the  entropy  quantity  on  the  left-hand  side  increases  if  we  choose  the  probabilities  on 
the  right-hand  side  more  equally. 


Conditional  entropy  The  conditional  probability  p(b\a)  of  outcome  Y  =  b  given  out¬ 
come  X  =  a  for  random  variables  X  and  Y  (not  necessarily  independent)  is  defined 
by 


p(b\a.)  = 


pja ,  b) 
EbPM)’ 


This  leads  to  the  following  analysis  of  the  information  in  X  about  Y ;  by  first  consider¬ 
ing  the  conditional  entropy  of  Y  given  X  as  the  average  of  the  entropy  for  Y  for  each 
value  of  X  weighted  by  the  probability  of  getting  that  particular  value: 


H(Y\X)  =  Y,P(a)H(Y\X  =  a) 

a 

a  b 

=  -  log  p(b\g). 

a,b 


The  quantity  on  the  left-hand  side  tells  us  how  uncertain  we  are  about  the  outcome  of 
Y  when  we  know  an  outcome  of  X.  With 


tfP0  =  -5>(a)logp(a) 

a 

=  log^p(a,6) 

0,6  6 

and  substituting  the  formula  for  p(b\a),  we  find  H(Y\X)  =  H(X,Y)  -  H(X). 
Rewrite  this  expression  as  the  entropy  equality 

H(X ,  Y)  =  H{X)  +  H(Y\X).  (2.12) 

This  can  be  interpreted  as,  ‘the  uncertainty  of  the  joint  event  (X,Y)  is  the  uncer¬ 
tainty  of  X  plus  the  uncertainty  of  Y  given  X\  Combining  (2.11)  and  (2.12)  gives 
H(Y )  >  H(Y\X),  which  can  be  taken  to  imply  that  knowledge  of  X  can  never  in¬ 
crease  uncertainty  of  Y.  In  fact,  uncertainty  in  Y  will  be  decreased  unless  X  and  Y 
are  independent.  Finally,  the  information  in  the  outcome  X  =  a  about  Y  is  defined  as 


I(X  =  a  :  Y)  -  H(Y)  -  H(Y\X  =  a).  (2.13) 

Here  the  quantities  H{Y)  and  H(Y\X  =  a)  on  the  right-hand  side  of  the  equations  are 
always  equal  to  or  less  than  the  corresponding  quantities  under  the  uniform  distribution 
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we  analyzed  first.  The  values  of  the  quantities  I(X  =  a  :Y)  under  the  assumption  of 
uniform  distribution  of  Y  and  Y\X  —  a  versus  any  other  distribution  are  not  related 
by  inequality  in  a  particular  direction.  The  equalities  H(X\X  ==  a)  =  0  and  I(X  = 
a  :  X)  =  H(X)  hold  under  any  distribution  of  the  variables.  Since  I(X  =  a  :  Y)  is 
a  function  of  outcomes  of  X ,  while  I(Y  =  b  :  X)  is  a  function  of  outcomes  of  Y ,  we 
do  not  compare  them  directly.  However,  forming  the  expectation  defined  as 

E(I(X  —  a  :  Y))  =  £>(a) I(X  =  a  :  Y), 

a 

E(I(Y  =  b:X))  =  y>(6)/(Y  =  6  :  X), 

b 

and  combining  (2.12)  and  (2.13),  we  see  that  the  resulting  quantities  are  equal.  Denot¬ 
ing  this  quantity  by  /(X,  Y)  and  calling  it  the  mutual  information  in  X  and  Y ,  we  see 
that  this  information  is  symmetric: 

J(X,  Y)  =  E(I(X  =  a  :  Y))  =  JB(J(y  =  b  :  X)).  (2.14) 


Example  2.5 

Suppose  we  want  to  exchange  the  information  about  the  outcome  X  =  x  and  it  is 
known  already  that  outcome  Y  —  y  is  the  case.  Then  we  require  (using  the  Shannon- 
Fano  code)  about  logP(X  =  x\Y  =  y)  bits  to  communicate  x.  On  average,  over  the 
joint  distribution  P(X  =  x,  Y  =  y)  we  use  H (X| Y)  bits,  which  is  optimal  by  Shan¬ 
non’s  noiseless  coding  theorem.  In  fact,  exploiting  the  mutual  information  paradigm, 
the  expected  information  that  outcome  Y  =  y  gives  about  outcome  X  =  x  is  the  same 
as  the  expected  information  that  X  =  x  gives  about  Y  =  y.  0 

Interpretation  in  terms  of  sequential  questions  Just  as  we  did  for  the  entropy,  we 
can  also  re-interpret  mutual  information  in  terms  of  protocols  for  asking  questions. 
Suppose  that  B  sequentially  asks  questions  about  X,  but,  as  in  example  2.5,  before 
he  has  to  ask  any  questions,  B  is  told  that  Y  =  y.  B  then  sequentially  asks  questions 
to  find  out  the  value  of  X,  using  the  protocol  defined  by  the  Shannon-Fano  code  for 
P(X  =  *| Y  =  y).  By  Shannon’s  noiseless  coding  theorem,  this  is  the  optimal  protocol. 
Intuitively,  since  B  is  given  some  initial  information,  we  expect  that  B  has  to  ask  fewer 
questions  than  if  he  were  not  given  any  initial  information.  I(Y;  X)  denotes  exactly 
how  many  fewer  questions  B  can  expect  to  need  to  ask  on  average  if  he  is  already  told 
the  value  of  Y  before  asking  any  questions.  Here  the  average  is  over  both  X  and  Y . 
Indeed,  on  average,  B  needs  to  ask  fewer  questions,  since  7(F;  X)  >  0.  But  there  may 
certainly  exist  individual  y  such  that  I(Y  =  y  :  X)  is  negative.  For  example,  we  may 
have  X  =  {0, 1},  y  =  {0, 1},  P(X  =  1| Y  =  0)  =  1,  P{X  =  1| Y  =  1)  =  i,  P(Y  = 
1)  =  e.  Then  I(Y-,X)  —  H(e,  1  —  e)  whereas  I(Y  =  1  :  X)  =  H(e,  1  —  e)  +  e  —  1. 
For  small  e,  this  quantity  is  smaller  than  0. 

Problem  and  lacuna  The  quantity  /(X;  Y)  symmetrically  characterizes  to  what  ex¬ 
tent  random  variables  X  and  Y  are  correlated.  An  inherent  problem  with  probabilistic 
definitions  is  that  -  as  we  have  just  seen  -  although  E(I(Y  :  X))  is  always  positive, 
for  some  probability  distributions  and  some  y,  I(Y  =  y  :  X)  can  turn  out  to  be  neg¬ 
ative  -  which  definitely  contradicts  our  naive  notion  of  information  content.  How  is 
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this  possible?  The  concept  of  information  as  used  in  the  theory  of  communication  is 
a  probabilistic  notion,  which  is  natural  for  information  transmission  over  communi¬ 
cation  channels.  Nonetheless,  we  tend  to  identify  probabilities  of  messages  with  fre¬ 
quencies  of  messages  in  a  sufficiently  long  sequence,  which  under  some  conditions  on 
the  stochastic  source  can  be  rigorously  justified.  The  great  probabilist,  Kolmogorov, 
remarks,  ‘If  something  goes  wrong  here,  the  problem  lies  in  the  vagueness  of  our  ideas 
of  the  relation  between  mathematical  probability  theory  and  real  random  events  in  gen¬ 
eral’.  The  algorithmic  mutual  information  we  introduce  below  can  never  be  negative, 
and  in  this  sense  is  closer  to  the  intuitive  notion  of  information  content. 


2.5.2  Algorithmic  mutual  information 

Conditional  Kolmogorov  complexity  To  prepare  for  the  definition  of  Shannon  mu¬ 
tual  information,  we  first  needed  to  introduce  a  conditional  version  of  entropy.  Anal¬ 
ogously,  to  prepare  for  the  definition  of  algorithmic  mutual  information,  we  need  a 
notion  of  conditional  Kolmogorov  complexity.  Intuitively,  the  conditional  prefix  Kol¬ 
mogorov  complexity  K(x\y)  of  x  given  y  can  be  interpreted  as  the  shortest  prefix 
program  p  such  that,  when  y  is  given  to  the  program  p  as  input,  the  program  prints  x 
and  then  halts.  The  idea  of  providing  p  with  an  input  y  is  realized  by  putting  (p,  y) 
rather  than  just  p  on  the  input  tape  of  the  universal  prefix  machine  U . 

Definition  2.4 

The  conditional  prefix  Kolmogorov  complexity  of  x  given  y  (for  free)  is 
K(x\y)  =  min {l(p)  :  U((p,y))  =  x,p  €  {0,1}*}. 

P 

We  define  K(x)  =  K{x |e). 

Note  that  we  just  redefined  K(x)  so  that  the  unconditional  Kolmogorov  complexity  is 
exactly  equal  to  the  conditional  Kolmogorov  complexity  with  empty  input.  This  does 
not  contradict  our  earlier  definition:  we  can  choose  a  reference  prefix  machine  U  such 
that  U((p,  e))  =  U ( p ).  Then  we  automatically  have  K(x)  =  K(x |e). 

Recall  from  section  2.2  the  notation  =,  >.  By  definition,  K(x,  y)  =  K((x ,  y}).  Triv¬ 
ially,  the  symmetry  property  holds:  K{x ,  y)  =  K(y,  x).  An  interesting  property  is  the 
‘additivity  of  complexity’  property 

K(x,  y )  ±  K(x)  +  K(y\x*)  ±  K{y)  +  K(x\y*),  (2.15) 

where  x *  is  the  first  (in  standard  enumeration  order)  shortest  prefix  program  that  gen¬ 
erates  x  and  then  halts.  It  is  easy  to  see  that  x*  has  the  same  information  as  the  pair  x, 
K(x)\  given  x*  we  can  compute  x  and  l(x*)  =  K(x)\  given  x,  K(x)  we  can  run  all 
programs  simultaneously  in  dovetailed  fashion  and  select  the  first  program  of  length 
K(x)  that  halts  with  output  x  as  x *.  (Dovetailed  fashion  means  that  in  phase  k  of  the 
process  we  run  all  programs  i  for  j  steps  such  that  i  +  j  =  k,  k  =  1, 2,  •  •  • ).  Equation 
(2.15)  is  the  Kolmogorov  complexity  equivalent  of  the  entropy  equality  (2.12).  That 
this  latter  equality  holds  is  true  by  simply  rewriting  both  sides  of  the  equation  accord¬ 
ing  to  the  definitions  of  averages  of  joint  and  marginal  probabilities.  In  fact,  potential 
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individual  differences  are  averaged  out.  But  in  the  Kolmogorov  complexity  case  we  do 
nothing  like  that:  it  is  truly  remarkable  that  additivity  of  algorithmic  information  holds 
for  individual  objects. 

The  result  (2.15)  is  due  to  Gacs  [36],  can  be  found  as  theorem  3.9.1  in  [71]  and  has  a 
difficult  proof.  It  is  perhaps  instructive  to  point  out  that  the  version  with  just  x  and  y 
in  the  conditionals  doesn’t  hold  with  =,  but  holds  up  to  additive  logarithmic  terms  that 
cannot  be  eliminated. 

To  define  the  algorithmic  mutual  information  between  two  individual  objects  x  and  y 
with  no  probabilities  involved,  it  is  instructive  to  first  recall  the  probabilistic  notion 
(2.14).  Rewriting  (2.14)  as 

I{X,Y)  =  5Z5Zp(a;’y)('_logp(a:)  -1°gP(y)  +logp(x,y)), 

x  y 

and  noting  that  —  logp(s)  is  very  close  to  the  length  of  the  prefix-free  Shannon-Fano 
code  for  s,  we  are  led  to  the  following  definition.  The  information  in  y  about  x  is 
defined  as 


I(y  :  X)  =  K(x)  -  K(x\y*)  ±  K(x)  +  K(y)  -  K(x,  y),  (2.16) 

where  the  second  equality  is  a  consequence  of  (2.15)  and  states  that  this  information 
is  symmetrical,  I(x  :  y)  =  I(y  :  x)y  and  therefore  we  can  talk  about  mutual  informa¬ 
tion*.  Theorem  2.4  gave  the  relationship  between  entropy  and  ordinary  Kolmogorov 
complexity;  it  showed  that  the  entropy  of  distribution  P  is  approximately  equal  to  the 
expected  (under  P)  Kolmogorov  complexity.  Theorem  2.5  gives  the  analogous  result 
for  the  mutual  information  (to  facilitate  comparison  to  theorem  2.4,  note  that  x  and  y 
in  (2.17)  below  may  stand  for  strings  of  arbitrary  length  n). 

Theorem  2.5 

Given  a  recursive  probability  mass  distribution  p(x,  y)  over  (x,  y)  we  have 

I(X;Y )  -  Kip)  <  £5>(*,y)J(*  :  y)  <  I(X;Y )  +  2 K(p),  (2.17) 

x  y 

with  the  additive  constant  that  depending  only  on  p  (it  is  the  length  of  the  shortest 
prefix-free  program  that  computes  p(z,  y)  from  input  (x,  y)). 

Thus,  we  see  that  the  expectation  of  the  algorithmic  mutual  information  I{x  :  y)  is 
close  to  the  probabilistic  mutual  information  I(X ;  y). 

Interpretation  in  terms  of  sequential  questions  The  algorithmic  mutual  information 
I(y  :  x)  =  K(x)—K(x\y*)  which  equals  K(x)— K (x\y)  up  to  an  additive  logarithmic 
term  0( log  K  (y))  is  the  savings  in  number  of  questions  B  needs  to  ask  to  get  to  know 
x  if  B  already  knows  y.  Clearly,  if  y  is  the  empty  word,  no  information  at  all,  then 
B  needs  to  ask  K(x)  yes-no  questions  to  obtain  the  consecutive  bits  of  x*.  But  if  B 

*  The  notation  of  the  algorithmic  (individual)  notion  I(x  :  y)  distinguishes  it  from  the  prob¬ 
abilistic  (average)  notion  I(X\Y).  We  deviate  slightly  from  [71]  where  I(y  :  x)  is  defined  as 
K(x)  -  K(x\y). 
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already  knows  y  then  he  needs  to  ask  only  K(x\y)  such  questions  to  obtain  the  shortest 
program  to  compute  from  y  to  x.  The  caveat  being,  as  usual,  that  B  has  arbitrary 
amounts  of  time  and  storage  to  perform  its  computation  from  x  to  y.  For  specific 
individual  x,  y  this  number  can  be  far  less  than  the  average  as  given  by  Shannon’s 
mutual  information. 

Problem  and  lacuna  Entropy,  Kolmogorov  complexity  and  mutual  (algorithmic)  in¬ 
formation  are  concepts  that  do  not  distinguish  between  different  kinds  of  information 
(such  as  ‘meaningful’  and  ‘meaningless’  information).  Such  more  refined  notions  can 
be  arrived  at  by  constraining  the  description  methods  with  which  strings  are  allowed 
to  be  encoded,  and  by  considering  lossy  rather  than  lossless  encoding.  Yet  the  ba¬ 
sic  notions  entropy,  Kolmogorov  complexity  and  mutual  information  continue  to  play 
a  fundamental  role.  The  two  most  important  developments  are  rate-distortion  theory 
in  the  Shannon  setting  ([102],  [20]),  dealing  with  ‘useful’  information,  and  the  Kol¬ 
mogorov  structure  function  in  Kolmogorov’s  setting,  dealing  with  ‘meaningful’  in¬ 
formation  ([66],  [103],  [20],  [37],  [116],  [119],  [98]).  It  is  here  that  the  two  theories 
may  have  something  relevant  to  say  about  the  notions  of  ‘information’  that  are  stud¬ 
ied  within  the  logic  and  semantics  of  natural  language  communities  [114].  We  briefly 
illustrate  this  for  the  rate-distortion  theory. 


2*6  Shannon’s  rate  distortion:  information  in  questions 

As  before,  we  consider  a  situation  in  which  sender  A  wants  to  communicate  the  out¬ 
come  of  random  variable  X  to  receiver  B.  The  distribution  of  X  is  known  to  both  A 
and  B.  But  now  A  is  only  allowed  to  use  a  finite  number,  say  R  bits,  to  communi¬ 
cate,  so  that  A  can  only  send  2R  different  messages.  Then  the  encoding  function  E 
has  to  map  X  to  {0, l}Ry  and  D  has  to  map  {0, 1}R  back  to  X,  If  \X\  >  2R  or  if  X 
is  uncountable  (say,  X  —  M),  then  there  can  be  no  code  ( D ,  E)  such  that  for  all  x, 
D(E(x))  =  x.  Thus,  A  and  B  cannot  make  sure  that  x  can  always  be  reconstructed. 
As  the  next  best  thing,  they  may  agree  on  a  code  such  that  for  all  x,  D(E(x))  is  in 
some  sense  ‘as  close  as  possible’  to  the  original  x.  To  formalize  this  for  a  given  code 
( D,E ),  we  define  X  :  X  — >  X  as  the  function  X(x)  :=  D(E(x )),  and  we  let  X 
be  the  range  of  X.  We  may  interpret  X{x)  as  an  estimate  of  x,  and  X  as  the  set  of 
values  it  can  take.  We  assume  that  the  ‘goodness’  of  X(x)  as  an  approximation  of  x 
is  measured  using  some  distortion  function  d  :  A'  x  X  — >  R.  This  distortion  function 
may  be  anything  that  is  appropriate  to  the  situation  at  hand.  Once  d  is  fixed,  we  may 
consider  the  expected  distortion 

E(d(X,X))  =  J2Pxd(x,X(x)),  (2.18) 

where,  if  X  =  M,  the  sum  is  replaced  by  an  integral  and  px  stands  for  the  probability 
density  of  x  with  respect  to  Lebesgue  measure. 

In  the  rate  distortion  setting,  the  goal  of  A  and  B  is  to  determine  the  code  ( D ,  E)  with 
associated  X  that  minimizes  the  expected  distortion. 

Example  2.6 

Suppose  A  is  a  real-valued,  normally  (Gaussian)  distributed  random  variable  with 
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mean  E(X )  =  0  and  variance  E(X  ~  E(X))2  —  a2.  Let  us  use  the  squared  Euclidean 
distance  d(x,  x)  =  (x  —  x)2  as  a  distortion  measure.  If  A  is  allowed  to  use  R  bits, 
then  X  can  have  no  more  than  2R  elements,  whereas  X  =  R  is  uncountably  infinite. 
We  should  choose  X  and  the  function  X  such  that  (2.18)  is  minimized.  Suppose  first 
R  —  l.  Then  the  optimal  X  turns  out  to  be 


X{x)  = 


y^£72  if  X  >  0 


Thus,  the  domain  X  is  partitioned  into  two  regions,  one  corresponding  to  x  >  0,  and 
one  to  x  <  0.  That  the  boundary  should  be  at  x  =  0  is  evident  by  the  symmetry  of  the 
Gaussian  distribution  around  0.  Within  each  region,  one  then  picks  a  ‘representative 
point*  so  as  to  minimize  (2.18).  Similarly,  if  R  =  2,  then  X  should  be  partitioned  into 
4  regions,  each  of  which  are  to  be  represented  by  a  single  point  such  that  (2.18)  is 
minimized.  An  extreme  case  is  R  =  0:  how  should  B  estimate  X  if  it  is  not  given  any 
information  whatsoever?  This  means  that  X  (x)  must  take  the  same  value  for  all  x.  The 
expected  distortion  (2.18)  is  then  minimized  if  B  picks  X  =  0,  giving  distortion  equal 
to<x2.0 


There  is  no  reason  in  general  that  the  distortion  function  should  be  symmetric:  in  fact,  it 
may  be  anything  that  pertains  to  the  situation  at  hand.  It  can  be  considered  as  (minus)  a 
utility  function,  indicating  the  loss  that  B  incurs  if  he  has  to  predict  x  without  knowing 
its  precise  value. 

Interpretation  in  terms  of  sequential  questions  Previously,  we  interpreted  entropy 
as  the  expected  minimum  number  of  yes/no-questions  that  receiver  has  to  ask  to  sender 
in  order  to  determine  the  precise  outcome  x  of  a  random  variable  X. 

The  present  setting  can  be  interpreted  in  terms  of  a  more  involved  question-and-answer 
game:  now  receiver  is  allowed  to  ask  only  R  yes/no-questions.  He  then  has  to  come  up 
with  a  guess  x  of  the  outcome  x.  The  quality  of  this  guess  is  measured  by  c£(x,  x).  The 
goal  of  the  receiver  is  now  to  ask  the  R  ‘cleverest  possible  questions*  that  reduce  his 
expected  distortion  as  much  as  possible;  equivalently,  they  increase  his  expected  utility 
as  much  as  possible.  Thus,  there  is  a  relation  to  ‘quality  and  quantity  of  information 
exchange*  [114]  as  studied  in  natural  language  semantics. 

As  a  concrete  case,  if  R  =  1,  then  in  the  Gaussian  example  above,  receiver  should 
ask  ‘Is  x  E  [0,  oo)  or  not?’.  Every  other  question  reduces  the  expected  distortion  by  a 
lesser  amount.  In  general,  the  present  question-and-answer  game  is  very  different  from 
the  original  game  where  the  goal  was  to  minimize  the  total  number  of  questions.  But 
the  following  example  shows  that,  if  we  take  a  special  distortion  measure,  then  the  goal 
of  minimizing  distortion  and  minimizing  total  number  of  questions  are  reconciled. 

Example  2.7 

Suppose  receiver  wants  to  estimate  the  actual  x  by  a  probability  distribution  P  on  X. 
Thus,  if  R  bits  are  allowed  to  be  used,  one  of  R  different  distributions  on  X  can  be  sent 
to  receiver.  The  best  that  can  be  done  is  to  partition  X  into  2R  subsets  A\,  •  *  •  , 
Sender  observes  the  i  such  that  x  e  Ai  and  passes  this  information  on  to  receiver.  A 
little  thought  reveals  that  the  information  i  tells  the  receiver  that  X  is  now  distributed 
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according  to  the  conditional  distribution  P(X  —  -\X  G  Ai).  It  is  then  natural  to 
measure  the  quality  of  distribution  P(X  =  -|X  e  Ai)  by  its  entropy,  i.e.  by  the 
additional  number  of  questions  that  receiver  has  to  ask  before  he  knows  the  value  of 
x  with  certainty.  That  is,  we  take  d(x,  P)  =  —  logP(x):  the  distortion  function  is 
the  Shannon-Fano  code  length  for  the  communicated  distribution.  Here  we  implicitly 
generalized  the  definition  of  ‘distortion’  measure:  we  do  not  require  the  estimates  X 
to  take  values  in  X  any  more.  Rather,  they  are  now  a  set  of  probability  distributions  on 
X;  the  new  definition  includes  the  former  as  a  special  case. 

With  d(x,P)  =  —log  P(x),  the  expected  distortion  is  E(d(X)  P))  =  H(P).  The 
minimum  achievable  distortion  d*(r)  for  R~r  is  given  by 

d*  (r)  =  min  I(Y ;  X) , 

where  y  =  {1,  •  •  •  ,  2^},  and  the  minimum  is  over  all  sets  y  and  all  distributions  P* 
over  Xxy  such  that  for  all  y  e  y,  P*(Y  =  y)  =  P*(Y  =  y,  X  =  x)  =  P(Y  = 
y).  In  particular,  for  r  =  0,  d*(r)  =  H(P );  for  r  >  H(P),  d*(r)  =  0;  for  general  r, 
d*(r)  is  the  minimum  expected  number  of  questions  that  B  still  has  to  ask  to  determine 
x ,  just  after  B  has  already  been  given  the  answers  to  the  first  r  questions. 

Thus,  if  we  pick  the  Shannon-Fano  code  length  as  the  distortion  measure,  then  the  rate- 
distortion  theory  is  reconciled  with  the  lossless  compression  theory.  In  this  case,  the 
distortion-rate  function  d*(r)  shows  how  fast  the  entropy  decreases  (the  information 
gained  by  receiver  increases)  if  receiver  always  asks  the  ‘cleverest  possible  question’, 
that  has  the  highest  expected  information  gain.  0 

Rate  distortion  and  mutual  information  As  R  increases,  the  minimum  achievable 
distortion  becomes  smaller  and  smaller.  Shannon  was  interested  in  studying  the  func¬ 
tional  relationship  between  R  and  the  minimum  achievable  distortion  d *  for  a  given  R. 
This  is  called  the  distortion-rate  function.  For  technical  reasons  it  is  often  more  con¬ 
venient  to  study  R  as  a  function  of  d*.  This  is  the  celebrated  rate-distortion  function. 
As  one  of  the  main  results  in  his  original  paper,  Shannon  [102]  showed  that  there 
is  a  deep  connection  between  the  mutual  information  and  the  rate-distortion  func¬ 
tion  which  holds  no  matter  what  distortion  function  d  is  used  -  thus  not  only  for  the 
Shannon-Fano  distortion.  We  only  mention  this  result  because  it  illustrates  that  mutual 
information  is  a  fundamental  notion;  for  a  precise  statement  we  refer  to  [20]. 


2.7  Discussion 

Of  the  three  most  important  developments  in  Shannon’s  original  paper,  we  only  dis¬ 
cussed  two:  the  noiseless  coding  theorem  for  lossless  compression  (theorem  2.3)  and 
the  notion  of  rate-distortion  related  to  lossy  compression.  We  did  not  discuss  the  chan¬ 
nel  coding  theorem,  which  is  related  to  lossless  communication  over  a  noisy  channel. 
These  and  many  other  topics  in  Shannon  information  theory  are  thoroughly  discussed 
and  explained  in  the  standard  reference  [20]. 

Kolmogorov  complexity  has  many  applications  which  we  could  not  discuss  here.  It 
leads  to  a  formal  notion  of  randomness  of  individual  sequences  that  does  not  refer  to 
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an  underlying  probability  distribution.  Also,  it  lies  at  the  basis  of  a  powerful  mathe¬ 
matical  theory  of  inductive  inference.  Third,  it  has  led  to  a  new  mathematical  proof 
technique  called  the  incompressibility  method.  These  and  many  other  topics  in  Kol¬ 
mogorov  complexity  are  thoroughly  discussed  and  explained  in  the  standard  reference 
[71].  We  end  by  mentioning  a  recent  development:  the  Kolmogorov  structure  function. 

The  Kolmogorov  structure  function  ([66],  [103],  [20],  [37],  [116],  [119],  [98])  can  be 
viewed  (to  some  extent)  as  the  analogue  in  Kolmogorov’s  theory  of  Shannon’s  rate 
distortion.  It  is  based  on  encoding  objects  (strings)  in  two  parts:  a  structural  and  a  ran¬ 
dom  part.  We  encountered  a  very  simple  example  of  such  a  description  in  example  2.3, 
where  we  first  encoded  the  frequency  of  ones  in  a  string  (a  very  simple  ‘structure’)  and 
then  the  particular  sequence  with  the  given  frequency  (corresponding  to  the  ‘random’ 
part  of  the  description).  Intuitively,  the  ‘meaning’  of  the  string  resides  in  the  struc¬ 
tural  part  and  the  size  of  the  structural  part  quantifies  the  ‘meaningful’  information 
in  the  message.  Recently,  there  have  been  many  new  results  in  this  area  [116].  Kol¬ 
mogorov’s  structure  function  is  closely  related  to  J.  Rissanen’s  minimum  description 
length  (MDL)  principle  for  inductive  inference.  In  its  simplest  guise,  this  says  that  the 
best  theory  for  a  given  set  of  data  is  the  theory  that  minimizes  the  description  length 
of  the  theory  plus  the  description  length  of  the  data  given  the  theory.  Thus,  data  is 
encoded  by  first  encoding  a  theory  (constituting  the  ‘structural’  part  of  the  data)  and 
then  encoding  the  data  using  the  properties  of  the  data  that  are  prescribed  by  the  the¬ 
ory.  Picking  the  theory  minimizing  the  total  description  length  leads  to  an  automatic 
trade-off  between  complexity  of  the  chosen  theory  and  its  goodness  of  fit  on  the  data. 
This  provides  a  practical  and  successful  principle  of  inductive  inference  that  may  be 
viewed  as  a  mathematical  formalization  of  ‘Occam’s  razor’*.  But  that  is  quite  another 
story  -  we  refer  to  [47]  and  [93]  for  details.  In  the  next  chapter  we  give  an  introduction 
to  the  minimum  description  length  principle  in  an  entirely  non-technical  way. 


In  short,  this  means  that  ‘the  simplest  explanation  is  the  best’. 
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3.  Introducing  MDL 


3.1  Introduction  and  overview 

How  does  one  decide  among  competing  explanations  of  data  given  limited  observa- 
tions?  This  is  the  problem  of  model  selection.  It  stands  out  as  one  of  the  most  impor¬ 
tant  problems  of  inductive  and  statistical  inference.  The  minimum  description  length 
(MDL)  principle  is  a  relatively  recent  method  for  inductive  inference  that  provides  a 
generic  solution  to  the  model  selection  problem.  MDL  is  based  on  the  following  in¬ 
sight:  any  regularity  in  the  data  can  be  used  to  compress  the  data,  i.e.  to  describe  it 
using  fewer  symbols  than  the  number  of  symbols  needed  to  describe  the  data  liter¬ 
ally.  The  more  regularities  there  are,  the  more  the  data  can  be  compressed.  Equating 
‘learning’  with  ‘finding  regularity’,  we  can  therefore  say  that  the  more  we  are  able 
to  compress  the  data,  the  more  we  have  learned  about  the  data.  Formalizing  this  idea 
leads  to  a  general  theory  of  inductive  inference  with  several  attractive  properties: 

1.  Occam’s  razor*  MDL  chooses  a  model  that  trades-off  goodness-of-fit  on  the 
observed  data  with  ‘complexity’  or  ‘richness’  of  the  model.  As  such,  MDL 
embodies  a  form  of  Occam’s  razor,  a  principle  that  is  both  intuitively  appealing 
and  informally  applied  throughout  all  the  sciences. 

2.  No  overfitting,  automatically  MDL  procedures  automatically  and  inherently 
protection  against  overfitting  and  can  be  used  to  estimate  both  the  parame¬ 
ters  and  the  structure  (e.g.,  number  of  parameters)  of  a  model.  In  contrast,  to 
avoid  overfitting  when  estimating  the  structure  of  a  model,  traditional  methods 
such  as  maximum  likelihood  must  be  modified  and  extended  with  additional, 
typically  ad  hoc  principles. 

3.  Bayesian  interpretation  MDL  is  closely  related  to  Bayesian  inference,  but 
avoids  some  of  the  interpretation  difficulties  of  the  Bayesian  approach  espe¬ 
cially  in  the  realistic  case  when  it  is  known  a  priori  to  the  modeler  that  none 
of  the  models  under  consideration  is  completely  true.  In  fact: 

4.  No  need  for  ‘underlying  truth’  In  contrast  to  other  statistical  methods,  MDL 
procedures  have  a  clear  interpretation  independent  of  the  existence  of  some 
underlying  ‘true’  model. 

5.  Predictive  interpretation  Because  data  compression  is  formally  equivalent  to 
a  form  of  probabilistic  prediction,  MDL  methods  can  be  interpreted  as  search¬ 
ing  for  a  model  with  good  predictive  performance  on  unseen  data. 

In  this  chapter,  we  introduce  the  MDL  principle  in  an  entirely  non-technical  way,  con¬ 
centrating  on  its  most  important  applications,  model  selection  and  avoidance  of  over¬ 
fitting.  In  section  3.2  we  discuss  the  relation  between  learning  and  data  compression. 
Section  3.3  introduces  model  selection  and  outlines  a  first,  ‘crude’  version  of  MDL 
that  can  be  applied  to  model  selection.  Section  3.4  indicates  how  these  ‘crude’  ideas 
need  to  be  refined  to  tackle  small  sample  sizes  and  differences  in  model  complexity 

*  Occam’s  razor  principle:  ‘Entities  should  not  be  multiplied  beyond  necessity.’  Arguably, 
this  serves  as  a  reason  to  refute  David  Bohm’s  [1 1]  [12]  alternative  deterministic  quantum  the¬ 
ory,  where  additional  entities  like  ‘hidden’  variables  and  a  ‘quantum  potential’  are  introduced. 
1  See  section  4.8.2,  example  4.20. 
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between  models  with  the  same  number  of  parameters.  Section  3.5  discusses  the  philos¬ 
ophy  underlying  MDL,  and  section  3.6  considers  its  relation  to  Occam’s  razor.  Section 
3.7  briefly  discusses  the  history  of  MDL.  All  this  is  summarized  in  section  3.8. 


3.2  The  fundamental  idea:  learning  as  data  compression 

We  are  interested  in  developing  a  method  for  learning  the  laws  and  regularities  in  data. 
The  following  example  will  illustrate  what  we  mean  by  this  and  give  a  first  idea  of  how 
it  can  be  related  to  descriptions  of  data. 

Regularity  ...  Consider  the  following  three  sequences.  We  assume  that  each  sequence 
is  10000  bits  long,  and  we  just  list  the  beginning  and  the  end  of  each  sequence: 

00010001000100010001  •  •  •  0001000100010001000100010001,  (3.1) 

01110100110100100110  ■  •  •  1010111010111011000101100010,  (3.2) 

00011000001010100000  *  •  ■  0010001000010000001000110000.  (3.3) 

The  first  of  these  three  sequences  is  a  2500-fold  repetition  of  0001.  Intuitively,  the  se¬ 
quence  looks  regular;  there  seems  to  be  a  simple  ‘law’  underlying  it;  it  might  make 
sense  to  conjecture  that  future  data  will  also  be  subject  to  this  law,  and  to  predict  that 
future  data  will  behave  according  to  this  law.  The  second  sequence  has  been  gener¬ 
ated  by  tosses  of  a  fair  coin.  It  is  intuitively  speaking  as  ‘random  as  possible’,  and  in 
this  sense  there  is  no  regularity  underlying  it.  Indeed,  we  cannot  seem  to  find  such  a 
regularity  either  when  we  look  at  the  data.  The  third  sequence  contains  approximately 
four  times  as  many  0s  as  Is.  It  looks  less  regular,  more  random  than  the  first;  but  it 
looks  less  random  than  the  second.  There  is  still  some  discernible  regularity  in  these 
data,  but  of  a  statistical  rather  than  of  a  deterministic  kind.  Again,  noticing  that  such 
a  regularity  is  there  and  predicting  that  future  data  will  behave  according  to  the  same 
regularity  seems  sensible. 

...  and  Compression  We  claimed  that  any  regularity  detected  in  the  data  can  be  used 
to  compress  the  data,  i.e.  to  describe  it  in  a  short  manner.  Descriptions  are  always 
relative  to  some  description  method  which  maps  descriptions  Df  in  a  unique  manner  to 
data  sets  D.  A  particularly  versatile  description  method  is  a  general-purpose  computer 
language  like  C  or  Pascal.  A  description  of  D  is  then  any  computer  program  that  prints 
D  and  then  halts.  Let  us  see  whether  our  claim  works  for  the  three  sequences  above. 
Using  a  language  similar  to  Pascal,  we  can  write  a  program 

for  i  =  1  to  2500;  print  ‘0001’;  next;  halt, 

which  prints  sequence  (3.1)  but  is  clearly  a  lot  shorter.  Thus,  sequence  (3.1)  is  indeed 
highly  compressible.  On  the  other  hand,  we  show  in  section  4.1,  that  if  one  generates 
a  sequence  like  (3.2)  by  tosses  of  a  fair  coin,  then  with  extremely  high  probability,  the 
shortest  program  that  prints  (3.2)  and  then  halts  will  look  something  like  this: 

print  ‘01110100110100100110  •  •  •  1010111010111011000101100010’;  halt. 

This  program’s  size  is  about  equal  to  the  length  of  the  sequence.  Clearly,  it  does  nothing 
more  than  repeat  the  sequence. 
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The  third  sequence  lies  in  between  the  first  two:  generalizing  n  =  10000  to  arbitrary 
length  n,  we  show  in  section  4.1  that  the  first  sequence  can  be  compressed  to  0(log  n) 
bits;  with  overwhelming  probability,  the  second  sequence  cannot  be  compressed  at  all; 
and  the  third  sequence  can  be  compressed  to  some  length  an ,  with  0  <  a  <  1. 

Example  3.1  (Compressing  various  regular  sequences) 

The  regularities  underlying  sequences  (3.1)  and  (3.3)  were  of  a  very  particular  kind. 
To  illustrate  that  any  type  of  regularity  in  a  sequence  may  be  exploited  to  compress 
that  sequence,  we  give  a  few  more  examples: 

The  Number  n  Evidently,  there  exists  a  computer  program  for  generating  the  first 
n  digits  of  n  -  such  a  program  could  be  based,  for  example,  on  an  infinite 
series  expansion  of  7r.  This  computer  program  has  constant  size,  except  for  the 
specification  of  n  which  takes  no  more  than  (9(logn)  bits.  Thus,  when  n  is 
very  large,  the  size  of  the  program  generating  the  first  n  digits  of  7r  will  be 
very  small  compared  to  n:  the  7r-digit  sequence  is  deterministic,  and  therefore 
extremely  regular. 

Physics  data  Consider  a  two-column  table  where  the  first  column  contains  numbers 
representing  various  heights  from  which  an  object  was  dropped.  The  second 
column  contains  the  corresponding  times  it  took  for  the  object  to  reach  the 
ground.  Assume  both  heights  and  times  are  recorded  to  some  finite  precision. 
In  section  3.3  we  illustrate  that  such  a  table  can  be  substantially  compressed 
by  first  describing  the  coefficients  of  the  second-degree  polynomial  H  that 
expresses  Newton’s  law;  then  describing  the  heights;  and  then  describing  the 
deviation  of  the  time  points  from  the  numbers  predicted  by  H. 

Natural  language  Most  sequences  of  words  are  not  valid  sentences  according  to  the 
English  language.  This  fact  can  be  exploited  to  substantially  compress  English 
text,  as  long  as  it  is  syntactically  mostly  correct:  by  first  describing  a  grammar 
for  English,  and  then  describing  an  English  text  D  with  the  help  of  that  gram¬ 
mar  [43],  D  can  be  described  using  much  less  bits  than  are  needed  without  the 
assumption  that  word  order  is  constrained.  0 

3.2.1  Kolmogorov  complexity  and  ideal  MDL 

To  formalize  our  ideas,  we  need  to  decide  on  a  description  method,  that  is,  a  for¬ 
mal  language  in  which  to  express  properties  of  the  data.  The  most  general  choice  is  a 
general-purpose*  computer  language  such  as  C  or  Pascal.  This  choice  leads  to  the  def¬ 
inition  of  the  Kolmogorov  complexity  [71]  of  a  sequence  as  the  length  of  the  shortest 
program  that  prints  the  sequence  and  then  halts.  The  lower  the  Kolmogorov  complexity 
of  a  sequence,  the  more  regular  it  is.  This  notion  seems  to  be  highly  dependent  on  the 
particular  computer  language  used.  However,  it  turns  out  that  for  every  two  general- 
purpose  programming  languages  A  and  B  and  every  data  sequence  D ,  the  length  of  the 
shortest  program  for  D  written  in  language  A  and  the  length  of  the  shortest  program  for 
D  written  in  language  B  differ  by  no  more  than  a  constant  c,  which  does  not  depend  on 
the  length  of  D.  This  so-called  invariance  theorem  says  that,  as  long  as  the  sequence  D 
is  long  enough,  it  is  not  essential  which  computer  language  one  chooses,  as  long  as  it  is 
general-purpose.  Kolmogorov  complexity  was  introduced,  and  the  invariance  theorem 

*  By  this  we  mean  that  a  universal  Turing  machine  can  be  implemented  in  it  [71]. 
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was  proved,  independently  in  [65],  [16]  and  [106].  Solomonoff’s  paper,  called  ‘A  the¬ 
ory  of  inductive  inference’,  contained  the  idea  that  the  ultimate  model  for  a  sequence 
of  data  may  be  identified  with  the  shortest  program  that  prints  the  data.  Solomonoff’s 
ideas  were  later  extended  by  several  authors,  leading  to  an  ‘idealized’  version  of  MDL 
([107],  [71],  [37]).  This  idealized  MDL  is  very  general  in  scope,  but  not  practically 
applicable,  for  the  following  two  reasons: 

1.  Uncomputability  It  can  be  shown  t  that  there  exists  no  computer  program  that,  for 

every  set  of  data  D,  when  given  D  as  input,  returns  the  shortest  program  that 
prints  D  [71]. 

2.  Arbitrariness/dependence  on  syntax  In  practice  we  are  confronted  with  small  data 

samples  for  which  the  invariance  theorem  does  not  say  much.  Then  the  hypoth¬ 
esis  chosen  by  idealized  MDL  may  depend  on  arbitrary  details  of  the  syntax 
of  the  programming  language  under  consideration. 


3.2.2  Practical  MDL 

Like  most  authors  in  the  field,  we  concentrate  here  on  non-idealized,  practical  versions 
of  MDL  that  explicitly  deal  with  the  two  problems  mentioned  above.  The  basic  idea 
is  to  scale  down  Solomonoff’s  approach  so  that  it  does  become  applicable.  This  is 
achieved  by  using  description  methods  that  are  less  expressive  than  general-purpose 
computer  languages.  Such  description  methods  C  should  be  restrictive  enough  so  that 
for  any  data  sequence  D,  we  can  always  compute  the  length  of  the  shortest  description 
of  D  that  is  attainable  using  method  C\  but  they  should  be  general  enough  to  allow  us 
to  compress  many  of  the  intuitively  ‘regular’  sequences.  The  price  we  pay  is  that,  using 
the  ‘practical’  MDL  principle,  there  will  always  be  some  regular  sequences  which  we 
will  not  be  able  to  compress.  But  we  already  know  that  there  can  be  no  method  for 
inductive  inference  at  all  which  will  always  give  us  all  the  regularity  there  is  -  simply 
because  there  can  be  no  automated  method  which  for  any  sequence  D  finds  the  shortest 
computer  program  that  prints  D  and  then  halts.  Moreover,  it  will  often  be  possible  to 
guide  a  suitable  choice  of  C  by  a  priori  knowledge  we  have  about  our  problem  domain. 
For  example,  below  we  consider  a  description  method  C  that  is  based  on  the  class  of 
all  polynomials,  such  that  with  the  help  of  C  we  can  compress  all  data  sets  which  can 
meaningfully  be  seen  as  points  on  some  polynomial. 


3.3  MDL  and  model  selection 

Let  us  recapitulate  our  main  insights  so  far: 

MDL:  The  basic  idea 

The  goal  of  statistical  inference  may  be  cast  as  trying  to  find  regularity  in  the  data. 
‘Regularity’  may  be  identified  with  ‘ability  to  compress’.  MDL  combines  these  two 
insights  by  viewing  learning  as  data  compression:  it  tells  us  that,  for  a  given  set  of 
hypotheses  Tt  and  data  set  D,  we  should  try  to  find  the  hypothesis  or  combination  of 
hypotheses  in  7i  that  compresses  D  most. 


t  This  follows  from  Godel’s  incompleteness  theorem,  for  a  popular  discussion  see  [56]. 
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This  idea  can  be  applied  to  all  sorts  of  inductive  inference  problems,  but  it  turns  out 
to  be  most  fruitful  in  (and  its  development  has  mostly  concentrated  on)  problems  of 
model  selection  and,  more  generally,  dealing  with  overfitting.  Here  is  a  standard  exam¬ 
ple  (we  explain  the  difference  between  ‘model’  and  ‘hypothesis’  after  the  example). 

Example  3.2  (Model  selection  and  overfitting) 

Consider  the  points  in  figure  3.1.  We  would  like  to  learn  how  the  y- values  depend  on 
the  z-values.  To  this  end,  we  may  want  to  fit  a  polynomial  to  the  points.  Straightfor¬ 
ward  linear  regression  will  give  us  the  leftmost  polynomial  -  a  straight  line  that  seems 
overly  simple:  it  does  not  capture  the  regularities  in  the  data  well.  Since  for  any  set  of 
n  points  there  exists  a  polynomial  of  the  (n  —  l)-st  degree  that  goes  exactly  through 
all  these  points,  simply  looking  for  the  polynomial  with  the  least  error  will  give  us  a 
polynomial  like  the  one  in  the  second  picture.  This  polynomial  seems  overly  complex: 
it  reflects  the  random  fluctuations  in  the  data  rather  than  the  general  pattern  underlying 
it.  Instead  of  picking  the  overly  simple  or  the  overly  complex  polynomial,  it  seems 
more  reasonable  to  prefer  a  relatively  simple  polynomial  with  small  but  nonzero  error, 
as  in  the  rightmost  picture. 


Figure  3  A:  A  simple ,  a  complex  and  a  trade-off  (3rd  degree )  polynomial. 


This  intuition  is  confirmed  by  numerous  experiments  on  real-world  data  from  a  broad 
variety  of  sources  [93],  [115],  [87]:  if  one  naively  fits  a  high-degree  polynomial  to 
a  small  sample  (set  of  data  points),  then  one  obtains  a  very  good  fit  to  the  data.  Yet 
if  one  tests  the  inferred  polynomial  on  a  second  set  of  data  coming  from  the  same 
source,  it  typically  fits  this  test  data  very  badly  in  the  sense  that  there  is  a  large  distance 
between  the  polynomial  and  the  new  data  points.  We  say  that  the  polynomial  overfits 
the  data.  Indeed,  all  model  selection  methods  that  are  used  in  practice  either  implicitly 
or  explicitly  choose  a  trade-off  between  goodness-of-fit  and  complexity  of  the  models 
involved.  In  practice,  such  trade-offs  lead  to  much  better  predictions  of  test  data  than 
one  would  get  by  adopting  the  ‘simplest’  (one  degree)  or  most  ‘complex*’  (n  -  1- 
degree)  polynomial.  MDL  provides  a  means  of  achieving  such  a  trade-off.  0 


It  will  be  useful  to  make  a  precise  distinction  between  ‘model’  and  ‘hypothesis’: 


*  Strictly  speaking,  in  our  context  it  is  not  very  accurate  to  speak  of  ‘simple’  of  ‘complex’ 
polynomials;  instead  we  should  call  the  set  of  first  degree  polynomials  ‘simple’,  and  the  set  of 
100-th  degree  polynomials  ‘complex’. 
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Models  vs.  Hypotheses 

We  use  the  phrase  point  hypothesis  to  refer  to  a  single  probability  distribution  or  func¬ 
tion.  An  example  is  the  polynomial  5x2  +  4x  +  3.  A  point  hypothesis  is  also  known 
as  a  ‘simple  hypothesis’  in  the  statistical  literature.  We  use  the  word  model  to  refer  to 
a  family  (set)  of  probability  distributions  or  functions  with  the  same  functional  form. 
An  example  is  the  set  of  all  second-degree  polynomials.  A  model  is  also  known  as 
a  ‘composite  hypothesis’  in  the  statistical  literature.  We  use  hypothesis  as  a  generic 
term,  referring  to  both  point  hypotheses  and  models. 

In  our  terminology,  the  problem  described  in  example  3.2  is  a  ‘hypothesis  selection 
problem’  if  we  are  interested  in  selecting  both  the  degree  of  a  polynomial  and  the 
corresponding  parameters;  it  is  a  ‘model  selection  problem’  if  we  are  mainly  interested 
in  selecting  the  degree. 

To  apply  MDL  to  polynomial  or  other  types  of  hypothesis  and  model  selection,  we 
have  to  make  precise  the  somewhat  vague  insight  ‘learning  may  be  viewed  as  data 
compression’.  This  can  be  done  in  various  ways.  In  this  section,  we  concentrate  on  the 
earliest  and  simplest  implementation  of  the  idea.  This  is  the  so-called  crude  t  two-part 
code  version  of  MDL: 

Crude,  two-part  version  of  MDL  principle  (informally  stated) 

Let  TiSl\  •  •  •  be  a  list  of  candidate  models  (e.g.,  H ^  is  the  set  of  /c-th  degree 
polynomials),  each  containing  a  set  of  point  hypotheses  (e.g.,  individual  polynomials). 
The  best  point  hypothesis  H  €  U---  to  explain  the  data  D  is  the  one 

which  minimizes  the  sum  L(H )  -f  L(D\H),  where 

•  L(H)  is  the  length,  in  bits,  of  the  description  of  the  hypothesis,  and 

•  L(D\H)  is  the  length,  in  bits,  of  the  description  of  the  data  when  encoded  with 
the  help  of  the  hypothesis. 

The  best  model  to  explain  D  is  the  smallest  model  containing  the  selected  H. 


Example  3.3  (Polynomials  continued) 

In  our  previous  example,  the  candidate  hypotheses  were  polynomials.  We  can  describe 
a  polynomial  by  describing  its  coefficients  in  a  certain  precision  (number  of  bits  per 
parameter).  Thus,  the  higher  the  degree  of  a  polynomial  or  the  precision,  the  more  bits 
we  need  to  describe  it  and  the  more  ‘complex’  it  becomes.  A  description  of  the  data 
‘with  the  help  of’  a  hypothesis  means  that  the  better  the  hypothesis  fits  the  data,  the 
shorter  the  description  will  be.  A  hypothesis  that  fits  the  data  well  gives  us  a  lot  of 
information  about  the  data.  Such  information  can  always  be  used  to  compress  the  data 
(section  4.1).  Intuitively,  this  is  because  we  only  have  to  code  the  errors  the  hypothesis 
makes  on  the  data  rather  than  the  full  data.  In  our  polynomial  example,  the  better  a 
polynomial  H  fits  D,  the  fewer  bits  we  need  to  encode  the  discrepancies  between  the 
actual  y-values  yi  and  the  predicted  y-values  H(xi).  We  can  typically  find  a  very  com¬ 
plex  point  hypothesis  (large  L(H ))  with  a  very  good  fit  (small  L(D\H)).  We  can  also 
typically  find  a  very  simple  point  hypothesis  (small  L(H))  with  a  rather  bad  fit  (large 

1  The  terminology  ‘crude  MDL’  is  not  standard.  It  is  introduced  here  for  pedagogical  reasons, 
to  clarify  the  importance  of  having  a  single,  unified  principle  for  designing  codes.  It  should  be 
noted  that  Rissanen’s  and  Barron’s  early  theoretical  papers  on  MDL  already  contain  such  prin¬ 
ciples,  albeit  in  a  slightly  different  form  than  in  their  recent  papers.  Early  practical  applications 
[83],  [43]  often  do  use  ad  hoc  two-part  codes  which  really  are  ‘crude’  in  the  sense  defined  here. 
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L(D\H)).  The  sum  of  the  two  description  lengths  will  be  minimized  at  a  hypothesis 
that  is  quite  (but  not  too)  ‘simple’,  with  a  good  (but  not  perfect)  fit.  0 


3.4  Crude  and  refined  MDL 

Crude  MDL  picks  the  H  minimizing  the  sum  L(H)  +  L(D\H).  To  make  this  proce¬ 
dure  well-defined,  we  need  to  agree  on  precise  definitions  for  the  codes  (description 
methods)  giving  rise  to  lengths  L(D\H)  and  L(H).  We  now  discuss  these  codes  in 
more  detail.  We  will  see  that  the  definition  of  L(H )  is  problematic,  indicating  that  we 
somehow  need  to  ‘refine’  our  crude  MDL  principle. 

Definition  of  L(D\H)  Consider  a  two-part  code  as  described  above,  and  assume  for 
the  time  being  that  all  H  under  consideration  define  probability  distributions.  If  If  is  a 
polynomial,  we  can  turn  it  into  a  distribution  by  making  the  additional  assumption  that 
the  Y- values  are  given  by  Y  =  H(X)  +  Z,  where  Z  is  a  normally  distributed  noise 
term. 

For  each  H  we  need  to  define  a  code  with  length  L(-\H)  such  that  L(D\H)  can  be 
interpreted  as  ‘the  code  length  of  D  when  encoded  with  the  help  of  H\  It  turns  out 
that  for  probabilistic  hypotheses,  there  is  only  one  reasonable  choice  for  this  code.  It 
is  the  so-called  Shannon-Fano  code,  satisfying,  for  all  data  sequences  Dy  L(D\H)  = 
—  logP(D|JT),  where  P(D\H)  is  the  probability  mass  or  density  of  D  according  to 
H  -  such  a  code  always  exists,  section  4.1. 

Definition  of  L(H ):  a  problem  for  crude  MDL  It  is  more  problematic  to  find  a  good 
code  for  hypotheses  H.  Some  authors  have  simply  used  ‘intuitively  reasonable’  codes 
in  the  past,  but  this  is  not  satisfactory:  since  the  description  length  L(H)  of  any  fixed 
point  hypothesis  H  can  be  very  large  under  one  code,  but  quite  short  under  another, 
our  procedure  is  in  danger  of  becoming  arbitrary.  Instead,  we  need  some  additional 
principle  for  designing  a  code  for  H .  In  the  first  publications  on  MDL  [88],  [89],  it 
was  advocated  to  choose  some  sort  of  minimax  code  for  Hy  minimizing,  in  some  pre¬ 
cisely  defined  sense,  the  shortest  worst-case  total  description  length  L(H)  +  L(D\H)y 
where  the  worst-case  is  over  all  possible  data  sequences.  Thus,  the  MDL  principle 
is  employed  at  a  ‘meta-level’  to  choose  a  code  for  H.  However,  this  code  requires 
a  cumbersome  discretization  of  the  model  space  H ,  which  is  not  always  feasible  in 
practice.  Alternatively,  Barron  [6]  encoded  H  by  the  shortest  computer  program  that, 
when  input  Dy  computes  P(D\H).  While  it  can  be  shown  that  this  leads  to  simi¬ 
lar  code  lengths,  it  is  computationally  problematic.  Later,  Rissanen  [90]  realized  that 
these  problems  could  be  side-stepped  by  using  a  one-part  rather  than  a  two-part  code. 
This  development  culminated  in  1996  in  a  completely  precise  prescription  of  MDL 
for  many,  but  certainly  not  all  practical  situations  [95].  We  call  this  modem  version  of 
MDL  ‘refined  MDL’. 

Refined  MDL  In  refined  MDL,  we  associate  a  code  for  encoding  D  not  with  a  single 
H  G  W,  but  with  the  full  model  H.  Thus,  given  model  Hy  we  encode  data  not  in  two 
parts  but  we  design  a  single  one-part  code  with  lengths  L(D\ H).  This  code  is  designed 
such  that  whenever  there  is  a  member  of  (parameter  in)  H  that  fits  the  data  well,  in  the 
sense  that  L(D\H)  is  small,  then  the  code  length  L(D\H)  will  also  be  small.  Codes 
with  this  property  are  called  universal  codes  in  the  information-theoretic  literature  [9]. 
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Among  all  such  universal  codes,  we  pick  the  one  that  is  minimax  optimal  in  a  sense 
made  precise  in  section  4.4.  For  example,  the  set  of  third-degree  polynomials  is 
associated  with  a  code  with  lengths  Z(-|i/®)  such  that,  the  better  the  data  D  are  fit  by 
the  best-fitting  third-degree  polynomial,  the  shorter  the  code  length  L(D\H).  L(D\H) 
is  called  the  stochastic  complexity  of  the  data  given  the  model. 

Parametric  complexity  The  second  fundamental  concept  of  refined  MDL  is  the  para¬ 
metric  complexity  of  a  parametric  model  H  which  we  denote  by  COMP(H).  This  is 
a  measure  of  the  ‘richness’  of  model  Hy  indicating  its  ability  to  fit  random  data.  This 
complexity  is  related  to  the  degrees-of-freedom  in  Hy  but  also  to  the  geometrical  struc¬ 
ture  of  H ;  see  example  3.4.  To  see  how  it  relates  to  stochastic  complexity,  let,  for  given 
data  D,  H  denote  the  distribution  in  Ti  which  maximizes  the  concomitant  probability, 
and  hence  minimizes  the  code  length  L(D\H)  of  D.  It  turns  out  that 

stochastic  complexity  of  D  given  H  =  L(D\H)  +  COMP(7f). 

Refined  MDL  model  selection  between  two  parametric  models  (such  as  the  models  of 
first  and  second  degree  polynomials)  now  proceeds  by  selecting  the  model  such  that 
the  stochastic  complexity  of  the  given  data  D  is  smallest.  Although  we  used  a  one- 
part  code  to  encode  data,  refined  MDL  model  selection  still  involves  a  trade-off  be¬ 
tween  two  terms:  a  goodness-of-fit  term  L(D\H )  and  a  complexity  term  COMP(7Y). 
However,  because  we  do  not  explicitly  encode  hypotheses  H  any  more,  there  is  no 
arbitrariness  any  more.  The  resulting  procedure  can  be  interpreted  in  several  differ¬ 
ent  ways,  some  of  which  provide  us  with  rationales  for  MDL  beyond  the  pure  coding 
interpretation  (sections  4.5.1  -  4.5.4): 

1 .  Counting  interpretation  The  parametric  complexity  of  a  model  is  the  loga¬ 
rithm  of  the  number  of  essentially  different,  distinguishable  point  hypotheses 
within  the  model. 

2.  Two-part  code  interpretation  For  large  samples,  the  stochastic  complexity 
can  be  interpreted  as  a  two-part  code  length  of  the  data  after  all,  where  hy¬ 
potheses  H  are  encoded  with  a  special  code  that  works  by  first  discretizing  the 
model  space  H  into  a  set  of  ‘maximally  distinguishable  hypotheses’,  and  then 
assigning  equal  code  length  to  each  of  these. 

3.  Bayesian  interpretation  In  many  cases,  refined  MDL  model  selection  coin¬ 
cides  with  Bayes  factor  model  selection  based  on  a  non-informative  prior  such 
as  Jeffreys’  prior  [10]. 

4.  Prequential  interpretation  Refined  MDL  model  selection  can  be  interpreted 
as  selecting  the  model  with  the  best  predictive  performance  when  sequentially 
predicting  -  prequential  -  unseen  test  data,  in  the  sense  described  in  section 
4.5.4.  This  makes  it  an  instance  of  Dawid’s  [22]  prequential  model  validation 
and  also  relates  it  to  cross-validation  methods. 

Refined  MDL  allows  us  to  compare  models  of  different  functional  form.  It  even  ac¬ 
counts  for  the  phenomenon  that  different  models  with  the  same  number  of  parameters 
may  not  be  equally  ‘complex’: 

Example  3.4 

Consider  two  models  from  psychophysics  describing  the  relationship  between  physical 
quantities  (e.g.,  light  intensity)  and  their  psychological  counterparts  (e.g.  brightness) 
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[80]:  y  =  axb  +  Z  (Stevens’  model)  and  y  =  a  ln(x  4*  b)  +  Z  (Fechner’s  model) 
where  Z  is  a  normally  distributed  noise  term.  Both  models  have  two  free  parameters; 
nevertheless,  it  turns  out  that  in  a  sense,  Stevens’  model  is  more  flexible  or  complex 
than  Fechner’s.  Roughly  speaking,  this  means  there  are  a  lot  more  data  patterns  that 
can  be  explained  by  Stevens’  model  than  can  be  explained  by  Fechner’s  model.  In  [80] 
many  samples  has  been  generated  of  size  4  from  Fechner’s  model,  using  some  fixed 
parameter  values.  They  then  fitted  both  models  to  each  sample.  In  67%  of  the  trials, 
Stevens’  model  fitted  the  data  better  than  Fechner’s,  even  though  the  latter  generated 
the  data.  Indeed,  in  refined  MDL,  the  ‘complexity’  associated  with  Stevens’  model  is 
much  larger  than  the  complexity  associated  with  Fechner’s.  If  both  models  fit  the  data 
equally  well,  MDL  will  prefer  Fechner’s  model.  0 

Summary  Refined  MDL  removes  the  arbitrary  aspect  of  crude,  two-part  code  MDL 
and  associates  parametric  models  with  an  inherent  ‘complexity’  that  does  not  depend 
on  any  particular  description  method  for  hypotheses.  We  should,  however,  warn  the 
reader  that  we  only  discussed  a  special,  simple  situation  in  which  we  compared  a  finite 
number  of  parametric  models  that  satisfy  certain  regularity  conditions.  Whenever  the 
models  do  not  satisfy  these  conditions,  or  if  we  compare  an  infinite  number  of  models, 
then  the  refined  ideas  have  to  be  extended.  We  then  obtain  a  ‘general’  refined  MDL 
principle,  which  employs  a  combination  of  one-part  and  two-part  codes. 


3.5  The  MDL  philosophy 

The  first  central  MDL  idea  is  that  every  regularity  in  data  may  be  used  to  compress  that 
data;  the  second  central  idea  is  that  learning  can  be  equated  with  finding  regularities  in 
data.  Whereas  the  first  part  is  relatively  straightforward,  the  second  part  of  the  idea  im¬ 
plies  that  methods  for  learning  from  data  must  have  a  clear  interpretation  independent 
of  whether  any  of  the  models  under  consideration  is  ‘true’  or  not.  Quoting  J.  Rissanen 
[93],  the  main  originator  of  MDL: 

”We  never  want  to  make  the  false  assumption  that  the  observed  data 
actually  were  generated  by  a  distribution  of  some  kind,  say  Gaussian, 
and  then  go  on  to  analyze  the  consequences  and  make  further  deduc¬ 
tions.  Our  deductions  may  be  entertaining  but  quite  irrelevant  to  the 
task  at  hand,  namely,  to  learn  useful  properties  from  the  data.” 

Based  on  such  ideas,  Rissanen  developed  a  radical  philosophy  of  learning  and  sta¬ 
tistical  inference  that  is  considerably  different  from  the  ideas  underlying  mainstream 
statistics,  both  frequentist  and  Bayesian.  We  now  describe  this  philosophy  in  more 
detail: 

1.  Regularity  as  compression  According  to  Rissanen,  the  goal  of  inductive  inference 
should  be  to  ‘squeeze  out  as  much  regularity  as  possible’  from  the  given  data.  The 
main  task  for  statistical  inference  is  to  distill  the  meaningful  information  present  in  the 
data,  i.e.,  to  separate  structure  (interpreted  as  the  regularity,  the  ‘meaningful  informa¬ 
tion’)  from  noise  (interpreted  as  the  ‘accidental  information’).  For  the  three  sequences 
of  example  3.2,  this  would  amount  to  the  following:  the  first  sequence  would  be  con¬ 
sidered  as  entirely  regular  and  ‘noiseless’.  The  second  sequence  would  be  considered 
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as  entirely  random  -  all  information  in  the  sequence  is  accidental,  there  is  no  struc¬ 
ture  present.  In  the  third  sequence,  the  structural  part  would  (roughly)  be  the  pattern 
that  4  times  as  many  Os  than  Is  occur;  given  this  regularity,  the  description  of  exactly 
which  of  all  sequences  with  four  times  as  many  Os  than  Is  occurs,  is  the  accidental 
information. 

2.  Models  as  languages  Rissanen  interprets  models  (sets  of  hypotheses)  as  nothing 

more  than  languages  for  describing  useful  properties  of  the  data  -  a  model  H  is  iden¬ 
tified  with  its  corresponding  universal  code  Different  individual  hypotheses 

within  the  models  express  different  regularities  in  the  data,  and  may  simply  be  regarded 
as  statistics,  that  is,  summaries  of  certain  regularities  in  the  data.  These  regularities  are 
present  and  meaningful  independently  of  whether  some  H*  e  H  is  the  ‘true  state  of 
nature’  or  not.  Suppose  that  the  model  H  under  consideration  is  probabilistic.  In  tradi¬ 
tional  theories,  one  typically  assumes  that  some  P*  G  H  generates  the  data,  and  then 
‘noise*  is  defined  as  a  random  quantity  relative  to  this  P*.  In  the  MDL  view  ‘noise’ 
is  defined  relative  to  the  model  H  as  the  residual  number  of  bits  needed  to  encode  the 
data  once  the  model  H  is  given.  Thus,  noise  is  not  a  random  variable:  it  is  a  function 
only  of  the  chosen  model  and  the  actually  observed  data.  Indeed,  there  is  no  place 
for  a  ‘true  distribution*  or  a  ‘true  state  of  nature’  in  this  view  -  there  are  only  models 
and  data.  To  bring  out  the  difference  to  the  ordinary  statistical  viewpoint,  consider  the 
phrase  ‘these  experimental  data  are  quite  noisy*.  According  to  a  traditional  interpreta¬ 
tion,  such  a  statement  means  that  the  data  were  generated  by  a  distribution  with  high 
variance.  According  to  the  MDL  philosophy,  such  a  phrase  means  only  that  the  data 
are  not  compressible  with  the  currently  hypothesized  model  -  as  a  matter  of  principle, 
it  can  never  be  ruled  out  that  there  exists  a  different  model  under  which  the  data  are 
very  compressible  (not  noisy)  after  all! 

3.  We  have  only  the  data  Many  (but  not  all*)  other  methods  of  inductive  inference  are 
based  on  the  idea  that  there  exists  some  ‘true  state  of  nature’,  typically  a  distribution 
assumed  to  lie  in  some  model  H.  The  methods  are  then  designed  as  a  means  to  iden¬ 
tify  or  approximate  this  state  of  nature  based  on  as  little  data  as  possible.  According  to 
Rissanen,  such  methods  are  fundamentally  flawed.  The  main  reason  is  that  the  meth¬ 
ods  are  designed  under  the  assumption  that  the  true  state  of  nature  is  in  the  assumed 
model  TCy  which  is  often  not  the  case.  Therefore,  such  methods  only  admit  a  clear  in¬ 
terpretation  under  assumptions  that  are  typically  violated  in  practice.  Many  cherished 
statistical  methods  are  designed  in  this  way  -  we  mention  hypothesis  testing,  minimum- 
variance  unbiased  estimation,  several  non-parametric  methods,  and  even  some  forms 
of  Bayesian  inference  -  see  example  4.20.  In  contrast,  MDL  has  a  clear  interpretation 
which  depends  only  on  the  data,  and  not  on  the  assumption  of  any  underlying  ‘state  of 
nature’. 

Example  3.5  (Models  that  are  wrong,  yet  useful) 

Even  though  the  models  under  consideration  are  often  wrong,  they  can  nevertheless  be 
very  useful.  Examples  are  the  successful  ‘Naive  Bayes’  model  for  spam  filtering,  Hid¬ 
den  Markov  Models  for  speech  recognition  (is  speech  a  stationary  ergodic  process? 
probably  not),  and  the  use  of  linear  models  in  econometrics  and  psychology.  Since 

*  For  example,  cross-validation  cannot  easily  be  interpreted  in  such  terms  of  ‘a  method  hunt¬ 
ing  for  the  true  distribution’. 
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these  models  are  evidently  wrong,  it  seems  strange  to  base  inferences  on  them  using 
methods  that  are  designed  under  the  assumption  that  they  contain  the  true  distribution. 
To  be  fair,  we  should  add  that  domains  such  as  spam  filtering  and  speech  recognition 
are  not  what  the  fathers  of  modem  statistics  had  in  mind  when  they  designed  their 
procedures  -  they  were  usually  thinking  about  much  simpler  domains,  where  the  as¬ 
sumption  that  some  distribution  P*  eH  is  ‘true’  may  not  be  so  unreasonable.  0 

4.  MDL  and  consistency  Let  H  be  a  probabilistic  model,  such  that  each  P  G  H  is 
a  probability  distribution.  Roughly,  a  statistical  procedure  is  called  consistent  relative 
to  H  if,  for  all  P*  G  W,  the  following  holds.  Suppose  data  are  distributed  according 
to  P*.  Then  given  enough  data,  the  learning  method  will  yield  a  good  approximation 
of  P*  with  high  probability.  Many  traditional  statistical  methods  have  been  designed 
with  consistency  in  mind  (section  4.2). 

The  fact  that  in  MDL,  we  do  not  assume  a  true  distribution  may  suggest  that  we  do 
not  care  about  statistical  consistency.  But  this  is  not  the  case:  we  would  still  like  our 
statistical  method  to  be  such  that  in  the  idealized  case  where  one  of  the  distributions 
in  one  of  the  models  under  consideration  actually  generates  the  data,  our  method  is 
able  to  identify  this  distribution,  given  enough  data.  If  even  in  the  idealized  special 
case  where  a  ‘truth’  exists  within  our  models,  the  method  fails  to  learn  it,  then  we 
certainly  cannot  trust  it  to  do  something  reasonable  in  the  more  general  case,  where 
there  may  not  be  a  ‘true  distribution’  underlying  the  data  at  all.  Thus,  consistency  is 
important  in  the  MDL  philosophy.  However,  it  is  used  as  a  sanity  check  (for  a  method 
that  has  been  developed  without  making  distributional  assumptions)  rather  than  as  a 
design  principle. 

In  fact,  mere  consistency  is  not  sufficient.  We  would  like  our  method  to  converge  to 
the  imagined  true  P*  fast,  based  on  as  small  a  sample  as  possible.  Two-part  code 
MDL  with  ‘clever’  codes  achieves  good  rates  of  convergence  in  this  sense  (Barron  and 
Cover  [8],  complemented  by  [127],  show  that  in  many  situations,  the  rates  are  minimax 
optimal).  The  same  seems  to  be  true  for  refined  one-part  code  MDL  [9],  although  there 
is  at  least  one  surprising  exception  where  inference  based  on  the  normalized  maximum 
likelihood  and  Bayesian  universal  model  behaves  abnormally  -  see  [21]  for  the  details. 

Summary  The  MDL  philosophy  is  quite  agnostic  about  whether  any  of  the  models 
under  consideration  is  ‘true’,  or  whether  something  like  a  ‘true  distribution’  even  ex¬ 
ists.  Nevertheless,  it  has  been  suggested  [124],  [27]  that  MDL  embodies  a  naive  belief 
that  ‘simple  models’  are  ‘a  priori  more  likely  to  be  true’  than  complex  models.  Below 
we  explain  why  such  claims  are  mistaken. 


3.6  MDL  and  Occam’s  razor 

When  two  models  fit  the  data  equally  well,  MDL  will  choose  the  one  that  is  the  ‘sim¬ 
plest’  in  the  sense  that  it  allows  for  a  shorter  description  of  the  data.  As  such,  it  imple¬ 
ments  a  precise  form  of  Occam’s  razor  -  even  though  as  more  and  more  data  becomes 
available,  the  model  selected  by  MDL  may  become  more  and  more  ‘complex’ !  Oc¬ 
cam’s  razor  is  sometimes  criticized  for  being  either  (1)  arbitrary  or  (2)  false  [124], 
[27].  Do  these  criticisms  apply  to  MDL  as  well? 
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1.  ‘Occam’s  razor  (and  MDL)  is  arbitrary’  Because  ‘description  length’  is  a  syn¬ 
tactic  notion  it  may  seem  that  MDL  selects  an  arbitrary  model:  different  codes  would 
have  led  to  different  description  lengths,  and  therefore,  to  different  models.  By  chang¬ 
ing  the  encoding  method,  we  can  make  ‘complex’  things  ‘simple’  and  vice  versa.  This 
overlooks  the  fact  we  are  not  allowed  to  use  just  any  code  we  like!  ‘Refined’  MDL  tells 
us  to  use  a  specific  code,  independent  of  any  specific  parameterization  of  the  model, 
leading  to  a  notion  of  complexity  that  can  also  be  interpreted  without  any  reference  to 
‘description  lengths’  (see  also  section  4.9.1). 

2.  ‘Occam’s  razor  is  false’  It  is  often  claimed  that  Occam’s  razor  is  false  -  we  often 
try  to  model  real-world  situations  that  are  arbitrarily  complex,  so  why  should  we  favor 
simple  models?  In  the  words  of  G.  Webb:  ‘What  good  are  simple  models  of  a  complex 
world?’ 

The  short  answer  is:  even  if  the  true  data  generating  machinery  is  very  complex,  it  may 
be  a  good  strategy  to  prefer  simple  models  for  small  sample  sizes.  Thus,  MDL  (and 
the  corresponding  form  of  Occam’s  razor)  is  a  strategy  for  inferring  models  from  data 
(‘choose  simple  models  at  small  sample  sizes’),  not  a  statement  about  how  the  world 
works  (‘simple  models  are  more  likely  to  be  true’)  -  indeed,  a  strategy  cannot  be  true  or 
false,  it  is  ‘clever’  or  ‘stupid’.  And  the  strategy  of  preferring  simpler  models  is  clever 
even  if  the  data  generating  process  is  highly  complex,  as  illustrated  by  the  following 
example: 

Example  3.6  (‘Infinitely’  complex  sources) 

Suppose  that  data  are  subject  to  the  law  Y  =  g(X )  T  Z  where  g  is  some  continuous 
function  and  Z  is  some  noise  term  with  mean  0.  If  g  is  not  a  polynomial,  but  X  only 
takes  values  in  a  finite  interval,  say  [-1, 1],  we  may  still  approximate  g  arbitrarily  well 
by  taking  higher  and  higher  degree  polynomials.  For  example,  let  g(x)  =  exp(x). 
Note  that,  the  exponential  function  is  computed  exploiting  polynomial  approximations. 
Then,  if  we  use  MDL  to  learn  a  polynomial  for  data  D  =  ((xj,  t/i),  •  •  •  ,  ( xn,yn )), 
the  degree  of  the  polynomial  selected  by  MDL  at  sample  size  n  will  increase 
with  n,  and  with  high  probability,  converges  to  g(x)  =  exp(x)  in  the  sense  that 
maxa.6[_1>1]  | /(n)(x)  —  g(x) |  — ►  0.  Of  course,  if  we  had  better  prior  knowledge  about 
the  problem  we  could  have  tried  to  learn  g  using  a  model  class  M  containing  the 
function  y  =  exp(x).  But  in  general,  both  our  imagination  and  our  computational 
resources  are  limited,  and  we  may  be  forced  to  use  imperfect  models.  0 

If,  based  on  a  small  sample,  we  choose  the  best-fitting  polynomial  /  within  the  set 
of  all  polynomials,  then,  even  though  /  will  fit  the  data  very  well,  it  is  likely  to  be 
quite  unrelated  to  the  ‘true’  g,  and  /  may  lead  to  disastrous  predictions  of  future  data. 
The  reason  is  that,  for  small  samples,  the  set  of  all  polynomials  is  very  large  -  in  the 
sense  of  the  number  of  elements  -  compared  to  the  set  of  possible  data  patterns  that 
we  might  have  observed.  Therefore,  any  particular  data  pattern  can  only  give  us  very 
limited  information  about  which  high-degree  polynomial  best  approximates  g .  On  the 
other  hand,  if  we  choose  the  best-fitting  f°  in  some  much  smaller  set  such  as  the  set  of 
second-degree  polynomials,  then  it  is  highly  probable  that  the  prediction  quality  (mean 
squared  error)  of  f°  on  future  data  is  about  the  same  as  its  mean  squared  error  on  the 
data  we  observed:  the  size  (complexity)  of  the  contemplated  model  is  relatively  small 
compared  to  the  set  of  possible  data  patterns  that  we  might  have  observed.  Therefore, 
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the  particular  pattern  that  we  do  observe  gives  us  a  lot  of  information  on  what  second- 
degree  polynomial  best  approximates  g. 

Thus,  (a)  f°  typically  leads  to  better  predictions  of  future  data  than  /;  and  (b)  unlike 
/,  f°  is  reliable  in  that  it  gives  a  correct  impression  of  how  good  it  will  predict  future 
data  even  if  the  ‘true’  g  is  ‘infinitely’  complex.  This  idea  does  not  just  appear  in  MDL, 
but  is  also  the  basis  of  Vapnik’s  [115]  structural  risk  minimization  approach  and  many 
standard  statistical  methods  for  non-parametric  inference.  In  such  approaches  one  ac¬ 
knowledges  that  the  data  generating  machinery  can  be  infinitely  complex  (e.g.,  not 
describable  by  a  finite  degree  polynomial).  Nevertheless,  it  is  still  a  good  strategy  to 
approximate  it  by  simple  hypotheses  (low-degree  polynomials)  as  long  as  the  sample 
size  is  small.  Summarizing: 

The  inherent  difference  between  under-  and  overfitting 

If  we  choose  an  overly  simple  model  for  our  data,  then  the  best-fitting  point  hypothesis 
within  the  model  is  likely  to  be  almost  the  best  predictor,  within  the  simple  model,  of 
future  data  coming  from  the  same  source.  If  we  overfit  (choose  a  very  complex  model) 
and  there  is  noise  in  our  data,  then,  even  if  the  complex  model  contains  the  ‘true’  point 
hypothesis,  the  best-fitting  point  hypothesis  within  the  model  is  likely  to  lead  to  very 
bad  predictions  of  future  data  coming  from  the  same  source. 

This  statement  is  very  imprecise  and  is  meant  more  to  convey  the  general  idea  than 
to  be  completely  true.  As  will  become  clear  in  section  4.9.1,  it  becomes  provably  true 
if  we  use  MDL’s  measure  of  model  complexity;  we  measure  prediction  quality  by 
logarithmic  loss;  and  we  assume  that  one  of  the  distributions  in  H  actually  generates 
the  data. 


3.7  History 

The  MDL  principle  has  mainly  been  developed  by  J.  Rissanen  in  a  series  of  papers 
starting  with  [88].  It  has  its  roots  in  the  theory  of  Kolmogorov  or  algorithmic  complex¬ 
ity  [71],  developed  in  the  1960s  by  Solomonoff  [106],  Kolmogorov  [65]  and  Chaitin 
[16].  Among  these  authors,  Solomonoff  (a  former  student  of  the  famous  philosopher 
of  science,  Rudolf  Carnap)  was  explicitly  interested  in  inductive  inference.  His  1964 
paper  contains  explicit  suggestions  on  how  the  underlying  ideas  could  be  made  practi¬ 
cal,  thereby  foreshadowing  some  of  the  later  work  on  two-part  MDL.  While  Rissanen 
was  not  aware  of  Solomonoff’s  work  at  the  time,  Kolmogorov’s  [65]  paper  did  serve 
as  an  inspiration  for  Rissanen’s  [88]  development  of  MDL. 

Another  important  inspiration  for  Rissanen  was  Akaike’s  [1]  AIC  method  for  model 
selection,  essentially  the  first  model  selection  method  based  on  information  theoretic 
ideas.  Even  though  Rissanen  was  inspired  by  AIC,  both  the  actual  method  and  the 
underlying  philosophy  are  substantially  different  from  MDL. 

MDL  is  much  closer  related  to  the  minimum  message  length  principle,  developed  by 
Wallace  and  his  co-workers  in  a  series  of  papers  starting  with  the  ground-breaking 
[121];  other  milestones  are  [122]  and  [123].  Remarkably,  Wallace  developed  his  ideas 
without  being  aware  of  the  notion  of  Kolmogorov  complexity.  Although  Rissanen  be¬ 
came  aware  of  Wallace’s  work  before  the  publication  of  [88],  he  developed  his  ideas 
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mostly  independently,  being  influenced  rather  by  Akaike  and  Kolmogorov.  Indeed,  de¬ 
spite  the  close  resemblance  of  both  methods  in  practice,  the  underlying  philosophy  is 
quite  different  -  see  section  4.8. 

The  first  publications  on  MDL  only  mention  two-part  codes.  Important  progress  was 
made  by  Rissanen  [90],  in  which  prequential  codes  are  employed  for  the  first  time  and 
[92],  introducing  the  Bayesian  mixture  codes  into  MDL.  This  led  to  the  development  of 
the  notion  of  stochastic  complexity  as  the  shortest  code  length  of  the  data  given  a  model 
[91],  [92].  However,  the  connection  to  Shtarkov’s  normalized  maximum  likelihood 
code  [104]  was  not  made  until  1996,  and  this  prevented  the  full  development  of  the 
notion  of  ‘parametric  complexity’.  In  the  mean  time,  in  his  impressive  Ph.D.  thesis, 
Barron  [6]  showed  how  a  specific  version  of  the  two-part  code  criterion  has  excellent 
frequentist  statistical  consistency  properties.  This  was  extended  by  Barron  and  Cover 
[8]  who  achieved  a  breakthrough  for  two-part  codes:  they  gave  clear  prescriptions  on 
how  to  design  codes  for  hypotheses,  relating  codes  with  good  minimax  code  length 
properties  to  rates  of  convergence  in  statistical  consistency  theorems.  Some  of  the  ideas 
of  Rissanen  [92]  and  Barron  and  Cover  [8]  were,  as  it  were,  unified  when  Rissanen 
[95]  introduced  a  new  definition  of  stochastic  complexity  based  on  the  normalized 
maximum  likelihood  code  (section  4.4).  The  resulting  theory  was  summarized  for  the 
first  time  by  Barron,  Rissanen,  and  Yu  [9],  and  is  called  ‘refined  MDL’  in  the  present 
overview. 


3.8  Summary  and  outlook 

We  discussed  how  regularity  is  related  to  data  compression,  and  how  MDL  employs 
this  connection  by  viewing  learning  in  terms  of  data  compression.  One  can  make  this 
precise  in  several  ways;  in  idealized  MDL  one  looks  for  the  shortest  program  that 
generates  the  given  data.  This  approach  is  not  feasible  in  practice,  and  here  we  concern 
ourselves  with  practical  MDL.  Practical  MDL  comes  in  a  crude  version  based  on  two- 
part  codes  and  in  a  modem,  more  refined  version  based  on  the  concept  of  universal 
coding.  The  basic  ideas  underlying  all  these  approaches  can  be  found  in  the  boxes 
spread  throughout  the  text. 

These  methods  are  mostly  applied  to  model  selection  but  can  also  be  used  for  other 
problems  of  inductive  inference.  In  contrast  to  most  existing  statistical  methodology, 
they  can  be  given  a  clear  interpretation  irrespective  of  whether  or  not  there  exists  some 
‘true’  distribution  generating  data  -  inductive  inference  is  seen  as  a  search  for  regular 
properties  in  (interesting  statistics  of)  the  data,  and  there  is  no  need  to  assume  anything 
outside  the  model  and  the  data.  In  contrast  to  what  is  sometimes  thought,  there  is  no 
implicit  belief  that  ‘simpler  models  are  more  likely  to  be  true’  -  MDL  does  embody  a 
preference  for  ‘simple’  models,  but  this  is  best  seen  as  a  strategy  for  inference  that  can 
be  useful  even  if  the  environment  is  not  simple  at  all. 

In  the  next  chapter,  we  formally  introduce  both  the  crude  and  the  refined  versions  of 
practical  MDL.  For  this,  it  is  absolutely  essential  that  the  reader  familiarizes  him-  or 
herself  with  two  basic  notions  of  coding  and  information  theory:  the  relation  between 
code  length  functions  and  probability  distributions,  and  (for  refined  MDL),  the  idea  of 
universal  coding  -  a  large  part  of  the  chapter  will  be  devoted  to  these. 
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4.  Minimum  description  length 


In  chapter  3,  we  introduced  the  MDL  principle  in  an  informal  way.  In  this  chapter,  we 
give  an  introduction  to  MDL  that  is  mathematically  precise.  Throughout  the  text,  we 
assume  some  basic  familiarity  with  probability  theory.  While  some  prior  exposure  to 
basic  statistics  is  highly  useful,  it  is  not  required.  The  chapter  can  be  read  without  any 
prior  knowledge  of  information  theory,  and  is  organized  as  follows: 

•  The  first  two  sections  are  of  a  preliminary  nature: 

-  Any  understanding  of  MDL  requires  some  minimal  knowledge  of  infor¬ 
mation  theory  -  in  particular  the  relationship  between  probability  distri¬ 
butions  and  codes.  This  relationship  is  explained  in  section  4.1. 

-  Relevant  statistical  notions  such  as  ‘maximum  likelihood  estimation’  are 
reviewed  in  section  4.2.  There  we  also  introduce  the  Markov  chain  model 
which  will  serve  as  an  example  model  throughout  the  text. 

•  Based  on  this  preliminary  material,  we  formalize  a  simple  version  of  the  MDL 
principle  in  section  4.3.  In  this  text  it  is  called  the  crude  two-part  MDL  princi¬ 
ple.  We  explain  why,  for  successful  practical  applications,  crude  MDL  needs 
to  be  refined. 

•  Section  4.4  is  once  again  preliminary:  it  discusses  universal  coding,  the  infor¬ 
mation  theoretic  concept  underlying  refined  versions  of  MDL. 

•  Sections  4.5  -  4.7  define  and  discuss  refined  MDL.  They  are  the  key  sections 
of  the  chapter: 

-  Section  4.5  discusses  basic  refined  MDL  for  comparing  a  finite  number 
of  simple  statistical  models  and  introduces  the  central  concepts  of  para¬ 
metric  and  stochastic  complexity.  It  gives  an  asymptotic  expansion  of 
these  quantities  and  interprets  them  from  a  compression,  a  geometric,  a 
Bayesian  and  a  predictive  point  of  view. 

-  Section  4.6  extends  refined  MDL  to  harder  model  selection  problems, 
and  in  doing  so  reveals  the  general,  unifying  idea. 

-  Section  4.7  briefly  discusses  how  to  extend  MDL  to  applications  beyond 
model  selection. 

•  The  next  two  sections  place  ‘refined  MDL’  in  its  context: 

-  Section  4.8  compares  MDL  to  other  approaches  to  inductive  inference, 
most  notably  the  related  but  different  Bayesian  approach. 

-  Section  4.9  discusses  perceived  as  well  as  real  problems  with  MDL.  The 
perceived  problems  relate  to  MDL’s  relation  to  Occam’s  razor,  the  real 
problems  relate  to  the  fact  that  applications  of  MDL  sometimes  perform 
suboptimally  in  practice. 

•  Finally,  section  4.10  provides  a  discussion. 

Throughout  the  text,  paragraph  headings  reflect  the  most  important  concepts.  Boxes 
summarize  the  most  important  findings.  Together,  paragraph  headings  and  boxes  pro¬ 
vide  an  overview  of  MDL  theory. 
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4.1  Information  theory  I:  probabilities  and  code  lengths 

This  first  section  is  a  mini-primer  on  information  theory,  focusing  on  the  relationship 
between  probability  distributions  and  codes.  A  good  understanding  of  this  relationship 
is  essential  for  a  good  understanding  of  MDL.  After  some  preliminaries,  section  4.1.1 
introduces  prefix  codes,  the  type  of  codes  we  work  with  in  MDL.  These  are  related  to 
probability  distributions  in  two  ways.  In  section  4.1.2  we  discuss  the  first  relationship, 
which  is  related  to  the  Kraft  inequality:  for  every  probability  mass  function  P,  there 
exists  a  code  with  length  |~—  logP],  and  vice  versa.  The  symbol  [•]  is  defined  below. 
Section  4.1.3  discusses  the  second  relationship,  related  to  the  information  inequality, 
which  says  that  if  the  data  are  distributed  according  to  P,  then  the  code  with  length 
log  P]  achieves  the  minimum  expected  code  length.  Throughout  the  section  we 
give  examples  relating  our  findings  to  our  discussion  of  regularity  and  compression  in 
section  3.2  of  chapter  3. 

Preliminaries  and  notational  conventions  -  codes  We  use  log  to  denote  logarithm 
to  base  2.  For  real-valued  x  we  use  \x "]  to  denote  the  ceiling  of  x ,  that  is,  x  rounded 
up  to  the  nearest  integer.  We  often  abbreviate  xi9  •  •  •  ,  xn  to  xn.  Let  A"  be  a  finite  or 
countable  set.  A  code  for  X  is  defined  as  a  1-to-l  mapping  from  X  to  Un>i{0,  l}n. 
Un>i{0,  l}n  is  the  set  of  binary  strings  (sequences  of  Os  and  Is)  of  length  1  or  larger. 
For  a  given  code  C,  we  use  C(x)  to  denote  the  encoding  of  x.  Every  code  C  induces 
a  function  Lq  :  X  — >  N  called  the  code  length  function.  Lq(x)  is  the  number  of  bits 
(symbols)  needed  to  encode  x  using  code  C. 

Our  definition  of  code  implies  that  we  only  consider  lossless  encoding  in  MDL*:  for 
any  description  z  it  is  always  possible  to  retrieve  the  unique  x  that  gave  rise  to  it.  More 
precisely,  because  the  code  C  must  be  1-to-l,  there  is  at  most  one  x  with  C(x)  —  z . 
Then  x  =  C~l(z),  where  the  inverse  C-1  of  C  is  sometimes  called  a  ‘decoding 
function’  or  ‘description  method’. 

Preliminaries  and  notational  conventions  -  probability  Let  P  be  a  probability  dis¬ 
tribution  defined  on  a  finite  or  countable  set  X.  We  use  P(x)  to  denote  the  probability 
of  x ,  and  we  denote  the  corresponding  random  variable  by  X.  If  P  is  a  function  on 
finite  or  countable  X  such  that  Ylx  ^(x)  <  we  call  P  a  defective  distribution.  A 
defective  distribution  may  be  thought  of  as  a  probability  distribution  that  puts  some  of 
its  mass  on  an  imagined  outcome  that  in  reality  will  never  appear. 

A  probabilistic  source  P  is  a  sequence  of  probability  distributions  P^\  p(2),  •  •  •  on 
X1,  X 2,  •  ■  ■  such  that  for  all  n,  P^  and  p(n+1)  are  compatible:  P^  is  equal  to  the 
‘marginal’  distribution  of  p(n+1)  restricted  to  n  outcomes.  That  is,  for  all  xn  £  Xn, 
p(n)  (xn)  =  P^n+1^(£n,  J/).  Whenever  this  cannot  cause  any  confusion,  we 

write  P(xn)  rather  than  P^n\xn).  A  probabilistic  source  may  be  thought  of  as  a  prob¬ 
ability  distribution  on  infinite  sequences*.  We  say  that  the  data  are  i.i.d.  (indepen¬ 
dently  and  identically  distributed)  under  source  P  if  for  each  n,  xn  £  Xn ,  P(xn )  = 

n?=i  PM- 

*  However,  see  section  4.8.4. 

t  Working  directly  with  distributions  on  infinite  sequences  is  more  elegant,  but  it  requires 
measure  theory,  which  we  want  to  avoid  here. 
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4.1.1  Prefix  codes 

In  MDL  we  only  work  with  a  subset  of  all  possible  codes,  the  so-called  prefix  codes. 
A  prefix  code*  is  a  code  such  that  no  code  word  is  a  prefix  of  any  other  code  word. 
For  example,  let  X  =  {a,  b:  c}.  Then  the  code  C\  defined  by  Ci(a)  =  0,  C\(b )  =  10, 
Ci(c)  =  11  is  prefix.  The  code  C2  with  C^a)  =  0,  C2(b)  =  10  and  02(c)  =  01, 
while  allowing  for  lossless  decoding,  is  not  a  prefix  code  since  0  is  a  prefix  of  01.  The 
prefix  requirement  is  natural,  and  nearly  ubiquitous  in  the  data  compression  literature. 
We  now  explain  why  this  is  the  case. 

Example  4.1 

Suppose  we  plan  to  encode  a  sequence  of  symbols  (xi,  •  *  •  ,  xn)  E  Xn .  We  already  de¬ 
signed  a  code  C  for  the  elements  in  X .  The  natural  thing  to  do  is  to  encode  (x  1 ,  •  *  •  ,  xn) 
by  the  concatenated  string  C(x{)C(x2)  •  *  *  C(xn).  In  order  for  this  method  to  succeed 
for  all  n,  all  (xi,  •  •  •  ,  xn)  e  Xn ,  the  resulting  procedure  must  define  a  code,  i.e.  the 
function  C mapping  (xi,  •  *♦  ,  xn)  to  C(x\)C(x2)  *  *  •  C(xn)  must  be  invertible.  If 
it  were  not,  we  would  have  to  use  some  marker  such  as  a  comma  to  separate  the  code 
words.  We  would  then  really  be  using  a  ternary  rather  than  a  binary  alphabet. 

Since  we  always  want  to  construct  codes  for  sequences  rather  than  single  symbols,  we 
only  allow  codes  C  such  that  the  extension  C defines  a  code  for  all  n.  We  say  that 
such  codes  have  ‘uniquely  decodable  extensions’.  It  is  easy  to  see  that  (a)  every  prefix 
code  has  uniquely  decodable  extensions.  Conversely,  although  this  is  not  at  all  easy  to 
see,  it  turns  out  that  (b),  for  every  code  C  with  uniquely  decodable  extensions,  there 
exists  a  prefix  code  Co  such  that  for  all  n,  xn  E  Xn ,  Lc^)(xn)  =  L  (n)(xn)  [20]. 
Since  in  MDL  we  are  only  interested  in  code-lengths,  and  never  in  actual  codes,  we 
can  restrict  ourselves  to  prefix  codes  without  loss  of  generality. 

Thus,  the  restriction  to  prefix  code  may  also  be  understood  as  a  means  to  send  concate¬ 
nated  messages  while  avoiding  the  need  to  introduce  extra  symbols  into  the  alphabet. 

0 


Whenever  in  the  sequel  we  speak  of  ‘code’,  we  really  mean  ‘prefix  code’.  We  call  a 
prefix  code  C  for  a  set  X  complete  if  there  exists  no  other  prefix  code  that  compresses 
at  least  one  x  more  and  no  x  less  then  C,  i.e.  if  there  exists  no  code  C  such  that  for  all 
x,  LC'(x)  <  Lq(x)  with  strict  inequality  for  at  least  one  x. 


4.1.2  The  Kraft  inequality  -  code  lengths  and  probabilities  I 

In  this  subsection  we  relate  prefix  codes  to  probability  distributions.  Essential  for  un¬ 
derstanding  the  relation  is  the  fact  that  no  matter  what  code  we  use,  most  sequences 
cannot  be  compressed,  as  demonstrated  by  the  following  example: 

Example  4.2  (Compression  and  small  subsets:  example  3.2  continued) 

*  Also  known  as  instantaneous  codes  and  called,  perhaps  more  justifiably,  ‘prefix-free’  codes 
in  [71]. 
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In  example  3.2  we  featured  the  following  three  sequences: 

00010001000100010001 • • • 0001000100010001000100010001,  (4.1) 

01110100110100100110 • • • 1010111010111011000101100010,  (4.2) 

00011000001010100000 ■ • • 0010001000010000001000110000.  (4.3) 

We  showed  that  (a)  the  first  sequence  -  an  n-fold  repetition  of  0001  -  could  be  sub¬ 
stantially  compressed  if  we  use  as  our  code  a  general  purpose  programming  language 
(assuming  that  valid  programs  must  end  with  a  halt-statement  or  a  closing  bracket, 
such  codes  satisfy  the  prefix  property).  We  also  claimed  that  (b)  the  second  sequence, 
n  independent  outcomes  of  fair  coin  tosses,  cannot  be  compressed,  and  that  (c)  the 
third  sequence  could  be  compressed  to  an  bits,  with  0  <  a  <  1.  We  are  now  in  a 
position  to  prove  statement  (b):  strings  which  are  ‘intuitively’  random  cannot  be  sub¬ 
stantially  compressed.  Let  us  take  some  arbitrary  but  fixed  description  method  over 
the  data  alphabet  consisting  of  the  set  of  all  binary  sequences  of  length  n.  Such  a  code 
maps  binary  strings  to  binary  strings.  There  are  2n  possible  data  sequences  of  length  n. 
Only  two  of  these  can  be  mapped  to  a  description  of  length  1  (since  there  are  only  two 
binary  strings  of  length  1 :  ‘0’  and  ‘  V ).  Similarly,  only  a  subset  of  at  most  2m  sequences 
can  have  a  description  of  length  m.  This  means  that  at  most  YliLi  2m  <  2m+1  data 
sequences  can  have  a  description  length  <  m.  The  fraction  of  data  sequences  of  length 
n  that  can  be  compressed  by  more  than  k  bits  is  therefore  at  most  2  k  and  as  such  de¬ 
creases  exponentially  in  k.  If  data  are  generated  by  n  tosses  of  a  fair  coin,  then  all  2n 
possibilities  for  the  data  are  equally  probable,  so  the  probability  that  we  can  compress 
the  data  by  more  than  k  bits  is  smaller  than  2~k.  For  example,  the  probability  that  we 
can  compress  the  data  by  more  than  20  bits  is  smaller  than  one  in  a  million. 

We  note  that  after  the  data  (4.2)  has  been  observed,  it  is  always  possible  to  design  a 
code  which  uses  arbitrarily  few  bits  to  encode  this  data  -  the  actually  observed  sequence 
may  be  encoded  as  ‘  V  for  example,  and  no  other  sequence  is  assigned  a  code  word.  The 
point  is  that  with  a  code  that  has  been  designed  before  seeing  any  data,  it  is  virtually 
impossible  to  substantially  compress  randomly  generated  data.  0 


The  example  demonstrates  that  achieving  a  short  description  length  for  the  data  is 
equivalent  to  identifying  the  data  as  belonging  to  a  tiny,  very  special  subset  out  of  all 
a  priori  possible  data  sequences. 

The  most  important  observation  Let  Z  be  finite  or  countable.  For  concreteness,  we 
may  take  Z  =  {0,  l}n  for  some  large  n,  say  n  =  10000.  From  example  4.2  we  know 
that,  no  matter  what  code  we  use  to  encode  values  in  Z ,  ‘most’  outcomes  in  Z  will 
not  be  substantially  compressible:  at  most  two  outcomes  can  have  description  length 
1  =  —  log  at  most  four  outcomes  can  have  length  2  =  —  log  and  so  on.  Now 
consider  any  probability  distribution  on  Z .  Since  the  probabilities  P(z)  must  sum  up 
to  one  (J2Z  P(z)  =  1),  ‘most’  outcomes  in  Z  must  have  small  probability  in  the  fol¬ 
lowing  sense:  at  most  2  outcomes  can  have  probability  >  at  most  4  outcomes  can 
have  probability  >  \ ;  at  most  8  can  have  >  |-th  etc.  This  suggests  an  analogy  between 
codes  and  probability  distributions:  each  code  induces  a  code  length  function  that  as¬ 
signs  a  number  to  each  z,  where  most  z’s  are  assigned  large  numbers.  Similarly,  each 
distribution  assigns  a  number  to  each  2,  where  most  z* s  are  assigned  small  numbers. 
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Observation  4.1  (Probability  mass  functions  correspond  to  code  length  functions) 

Let  Z  be  a  finite  or  countable  set  and  let  P  be  a  probability  distribution  on  Z.  Then 
there  exists  a  prefix  code  C  for  Z  such  that  for  all  z  £  Z,  Lc(z)  =  [—  log  P(z )] .  C 
is  called  the  code  corresponding  to  P. 

Similarly,  let  Co  be  a  prefix  code  for  Z.  Then  there  exists  a  (possibly  defective)  prob¬ 
ability  distribution  Pf  such  that  for  all  z  £  Z,  —  log  P*(z)  =  Lc'{z).  -P7  JS  called  the 
probability  distribution  corresponding  to  C* . 

Moreover,  Cf  is  a  complete  prefix  code  if  P  is  proper  (J2Z  P(^)  =  1). 

Thus,  large  probability  according  to  P  means  small  code  length  according  to  the  code 
corresponding  to  P  and  vice  versa.  We  are  typically  concerned  with  cases  where  Z 
represents  sequences  ofn  outcomes;  that  is,  Z  =  Xn  (n  >  1)  where  X  is  the  sample 
space  for  one  observation. 


It  turns  out  that  this  correspondence  can  be  made  mathematically  precise  by  means  of 
the  Kraft  inequality  [20].  We  neither  precisely  state  nor  prove  this  inequality;  rather, 
in  observation  4.1  we  state  an  immediate  and  fundamental  consequence:  probability 
mass  functions  correspond  to  code  length  functions.  The  following  example  illustrates 
this  and  at  the  same  time  introduces  a  type  of  code  that  will  be  frequently  employed  in 
the  sequel: 

Example  4.3  (Uniform  distribution  corresponds  to  fixed-length  code) 

Suppose  Z  has  M  elements.  The  uniform  distribution  Pjj  assigns  probabilities  to 
each  element.  We  can  arrive  at  a  code  corresponding  to  Pu  as  follows.  First,  order  and 
number  the  elements  in  Z  as  0, 1,  *  •  •  ,  M  —  1.  Then,  for  each  z  with  number  j,  set  C(z) 
to  be  equal  to  j  represented  as  a  binary  number  with  [log  M]  bits.  The  resulting  code 
has,  for  all  z  £  Z,  Lc ^  =  [logM]  =  [—  lo gPu(z)~\.  This  is  a  code  corresponding 
to  Pu  (observation  4,1).  In  general,  there  exist  several  codes  corresponding  to  Py,  one 
for  each  ordering  of  Z.  But  all  these  codes  share  the  same  length  function  Lu(z)  := 
[—  log Pjj(z)];  therefore,  Lu(z)  is  the  unique  code  length  function  corresponding  to 
Pu. 

For  example,  if  M  =  4,  Z  =  {a,  6,  c,  d},  we  can  take  C(a)  =  00,  C(b )  =  01,  C(c)  = 
10,  C(d)  =  11  and  then  Lu(z)  =  2  for  all  z  £  Z.  In  general,  codes  corresponding  to 
uniform  distributions  assign  fixed  lengths  to  each  z  and  are  called  fixed-length  codes. 
To  map  a  non-uniform  distribution  to  a  corresponding  code,  we  have  to  use  a  more 
intricate  construction  [20].  0 


In  practical  applications,  we  almost  always  deal  with  probability  distributions  P  and 
strings  xn  such  that  P(xn)  decreases  exponentially  in  n;  for  example,  this  will  typ¬ 
ically  be  the  case  if  data  are  i.i.d.,  such  that  P(xn)  —  n  P(xi)-  Then  —  log  P(xn) 
increases  linearly  in  n  and  the  effect  of  rounding  off  -  log  P(xn)  becomes  negligible, 
i.e.  [—  log  P(xn)']  «  -  log  P(xn).  Note  that  the  code  corresponding  to  the  product 
distribution  of  P  on  Xn  does  not  have  to  be  the  n-fold  extension  of  the  code  for  the 
original  distribution  P  on  X  -  if  we  were  to  require  that,  the  effect  of  rounding  off 
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would  be  on  the  order  of  n.  Instead,  we  directly  design  a  code  for  the  distribution  on 
the  larger  space  Z  =  Xn.  In  this  way,  the  effect  of  rounding  changes  the  code  length 
by  at  most  1  bit,  which  is  truly  negligible.  For  this  and  other§  reasons,  we  henceforth 
simply  neglect  the  integer  requirement  for  code  lengths.  This  simplification  allows  us 
to  identify  code  length  functions  and  (defective)  probability  mass  functions,  such  that 
a  short  code  length  corresponds  to  a  high  probability  and  vice  versa.  Furthermore,  as 
we  will  see,  in  MDL  we  are  not  interested  in  the  details  of  actual  encodings  C{z)\ 
we  only  care  about  the  code  lengths  Lc^zy  It  is  so  useful  to  think  about  these  as  log- 
probabilities,  and  conveniently  dispense  with  integer  restrictions  and  probabilities,  that 
we  will  simply  redefine  prefix  code  length  functions  as  (defective)  probability  mass 
functions  that  can  have  non-integer  code  lengths  -  see  observation  4.2. 


Observation  4.2  (New  definition  of  code  length  function) 

In  MDL  we  are  NEVER  concerned  with  actual  encodings;  we  are  only  concerned 
with  code  length  functions.  The  set  of  all  code  length  functions  for  finite  or  countable 
sample  space  Z  is  defined  as: 

Cz  =  {L:Z  -»  [0,  oo]|  2~L(z)  <  1},  (4.4) 

or  equivalently,  Lz  is  the  set  of  those  functions  L  on  Z  such  that  there  exists  a  function 
Q  with  YjZ  Q(z)  —  1  ar}d  f°r  z>  L(z)  =  -  log  Q(z).  (Q(z)  —  0  corresponds  to 
L(z)  =  oo).  Again ,  Z  usually  represents  a  sample  of  n  outcomes:  Z  =  Xn  (n  <  1) 
where  X  is  the  sample  space  for  one  observation. 


The  following  example  illustrates  idealized  code  length  functions  and  at  the  same  time 
introduces  a  type  of  code  that  will  be  frequently  used  in  the  sequel: 

Example  4.4  (‘Almost’  uniform  code  for  the  positive  integers) 

Suppose  we  want  to  encode  a  number  k  £  {1, 2,  •  }.  In  example  4.3,  we  saw  that  in 

order  to  encode  a  number  between  1  and  M,  we  need  log  M  bits.  What  if  we  cannot 
determine  the  maximum  M  in  advance?  We  cannot  just  encode  k  using  the  uniform 
code  for  {1,  •  •  *  ,  k},  since  the  resulting  code  would  not  be  prefix.  So  in  general,  we  will 
need  more  than  log  k  bits.  Yet  there  exists  a  prefix-free  code  which  performs  ‘almost’ 
as  well  as  log  k .  The  simplest  of  such  codes  works  as  follows,  k  is  described  by  a  code 
word  starting  with  [log  k]  Os.  This  is  followed  by  a  1,  and  then  k  is  encoded  using  the 
uniform  code  for  {1,  •  •  •  ,  2^1°sA:l }.  With  this  protocol,  a  decoder  can  first  reconstruct 
[log  k]  by  counting  all  0’s  before  the  leftmost  1  in  the  encoding.  He  then  has  an  upper 
bound  on  k  and  can  use  this  knowledge  to  decode  k  itself.  This  protocol  uses  less  than 
2  [log  A:]  +  1  bits.  Working  with  idealized,  non-integer  code-lengths  we  can  simplify 
this  to  2  log  k  +  1  bits.  To  see  this,  consider  the  function  P(x)  =  2-2Iogx”1.  An  easy 
calculation  gives 

£  PW=  £  =  i 

*6(1,2,  }  *€{1,2,-}  *6(1,2,-}  *6{2,3,-}  '  } 

§  For  example,  with  non-integer  code  lengths  the  notion  of  ‘code’  becomes  invariant  to  the 
size  of  the  alphabet  in  which  we  describe  data. 
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so  that  P  is  a  (defective)  probability  distribution.  Thus,  by  our  new  definition  (obser¬ 
vation  4.2),  there  exists  a  prefix  code  with,  for  all  fc,  L(k)  =  —  log  P(k )  =  2  log  k  + 1. 
We  call  the  resulting  code  the  ‘simple  standard  code  for  the  integers’.  In  section  4.4 
we  will  see  that  it  is  an  instance  of  a  so-called  ‘universal’  code. 

The  idea  can  be  refined  to  lead  to  codes  with  lengths  logfc  +  C>(log  log/c);  the  ‘best’ 
possible  refinement,  with  code  lengths  L{k)  increasing  monotonically  but  as  slowly  as 
possible  in  ky  is  known  as  ‘the  universal  code  for  the  integers’  [89].  However,  for  our 
purposes  in  this  chapter,  it  is  good  enough  to  encode  integers  k  with  2  log  A:  +  1  bits.  0 


Example  4.5  (Example  3.2  and  4.2,  continued) 

We  are  now  also  in  a  position  to  prove  the  third  and  final  claim  of  examples  3.2  and 
4.2.  Consider  the  three  sequences  (4.1),  (4.2)  and  (4.3)  again.  It  remains  to  investigate 
how  much  the  third  sequence  can  be  compressed.  Assume  for  concreteness  that,  before 
seeing  the  sequence,  we  are  told  that  the  sequence  contains  a  fraction  of  Is  equal  to 
|  +  6  for  some  small  unknown  e.  By  the  Kraft  inequality,  observation  4.1,  for  all 
distributions  P ,  there  exists  some  code  on  sequences  of  length  n  such  that  for  all 
xn  €  Xn,  L{xn)  =  [—  logP(xn)].  The  fact  that  the  fraction  of  Is  is  approximately 
equal  to  ^  suggests  to  model  xn  as  independent  outcomes  of  a  coin  with  bias  |-th.  The 
corresponding  distribution  Po  satisfies 

-logPo(zn)  =  -log(i)”(11  (|)n[01  =n(-(i  +  e)logi-(f-e)log|) 

=  n(log  5  -  §  +2e), 

where  denotes  the  number  of  occurrences  of  symbol  j  in  xn.  For  small  enough 
e,  the  part  between  brackets  is  smaller  than  1,  so  that,  using  the  code  L$  with  lengths 
—  logPo,  the  sequence  can  be  encoded  using  an  bits  were  a  satisfies  0  <  a  <  1. 
Thus,  using  the  code  Lq ,  the  sequence  can  be  compressed  by  a  linear  amount,  if  we 
use  a  specially  designed  code  that  assigns  short  code  lengths  to  sequences  with  about 
four  times  as  many  Os  than  Is. 

We  note  that  after  the  data  (4.3)  has  been  observed,  it  is  always  possible  to  design  a 
code  which  uses  arbitrarily  few  bits  to  encode  xn  -  the  actually  observed  sequence 
may  be  encoded  as  ‘1’  for  example,  and  no  other  sequence  is  assigned  a  code  word. 
The  point  is  that  with  a  code  that  has  been  designed  before  seeing  the  actual  sequence, 
given  only  the  knowledge  that  the  sequence  will  contain  approximately  four  times  as 
many  Os  than  Is,  the  sequence  is  guaranteed  to  be  compressed  by  an  amount  linear  in 
n.  0 

Continuous  sample  spaces  How  does  the  correspondence  work  for  continuous  valued 
A'?  In  this  chapter  we  only  consider  P  on  X  such  that  P  admits  a  density Whenever 
in  the  following  we  make  a  general  statement  about  sample  spaces  X  and  distributions 
P,  X  may  be  finite,  countable  or  any  subset  of  R*,  for  any  integer  l  >  1,  and  P(x) 
represents  the  probability  mass  function  or  density  of  P,  as  the  case  may  be.  In  the  con¬ 
tinuous  case,  all  sums  should  be  read  as  integrals.  The  correspondence  between  prob¬ 
ability  distributions  and  codes  may  be  extended  to  distributions  on  continuous- valued 
X:  we  may  think  of  L(xn)  :=  -  logP(xn)  as  a  code-length  function  corresponding 

1  As  understood  in  elementary  probability,  i.e.  with  respect  to  the  Lebesgue  measure. 
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to  Z  —  Xn  encoding  the  values  in  Xn  at  unit  precision;  here  P(xn)  is  the  density  of 
xn  according  to  P.  We  refer  to  [20]  for  further  details. 

4.13  The  information  inequality  -  code  lengths  and  probabilities  II 

In  the  previous  subsection,  we  established  the  first  fundamental  relation  between  prob¬ 
ability  distributions  and  code  length  functions.  We  now  discuss  the  second  relation, 
which  is  nearly  as  important. 

In  the  correspondence  to  code  length  functions,  probability  distributions  were  treated 
as  mathematical  objects  and  nothing  else.  That  is,  if  we  decide  to  use  a  code  C  to 
encode  our  data,  this  definitely  does  not  necessarily  mean  that  we  assume  our  data  to 
be  drawn  according  to  the  probability  distribution  corresponding  to  L :  we  may  have  no 
idea  what  distribution  generates  our  data;  or  conceivably,  such  a  distribution  may  not 
even  exisdL  Nevertheless,  if  the  data  are  distributed  according  to  some  distribution  P, 
then  the  code  corresponding  to  P  turns  out  to  be  the  optimal  code  to  use,  in  an  expected 
sense  -  see  observation  4.3.  This  result  may  be  recast  as  follows:  for  all  distributions 
P  and  Q  with  Q  ^  P, 


EP(~logQ(X))  >  Ep{—  log P(X)). 

In  this  form,  the  result  is  known  as  the  information  inequality.  It  is  easily  proved  using 
concavity  of  the  logarithm  [20]. 


Observation  4.3  (The  P  that  corresponds  to  L  minimizes  expected  code  length) 

Let  P  be  a  distribution  on  ( finite ,  countable  or  continuous-valued)  Z  and  let  L  be 
defined  by 

L  :=  min  Ep(L{Z)).  (4.5) 

Lecz 

Then  L  exists ,  is  unique,  and  is  identical  to  the  code  length  function  corresponding  to 
P ,  with  lengths  L(z)  =  -  log  P(z). 


The  information  inequality  says  the  following:  suppose  Z  is  distributed  according  to 
P  (‘generated  by  P’).  Then,  among  all  possible  codes  for  Z ,  the  code  with  lengths 
-  log  P(Z)  ‘on  average’  gives  the  shortest  encodings  of  outcomes  of  P.  Why  should 
we  be  interested  in  the  average?  The  law  of  large  numbers  [30]  implies  that,  for  large 
samples  of  data  distributed  according  to  P,  with  high  P-probability,  the  code  that  gives 
the  shortest  expected  lengths  will  also  give  the  shortest  actual  code  lengths,  which  is 
what  we  are  really  interested  in.  This  will  hold  if  data  are  i.i.d.,  but  also  more  generally 
if  P  defines  a  ‘stationary  and  ergodic’  process. 

Example  4.6 

Let  us  briefly  illustrate  this.  Let  P*,  Qa  and  Qp  be  three  probability  distributions  on 
X ,  extended  to  Z  —  Xn  by  independence.  Hence  p*{xn)  =  n^o*)  and  similarly 

M  Even  if  one  adopts  a  Bayesian  stance  and  postulates  that  an  agent  can  come  up  with  a 
(subjective)  distribution  for  every  conceivable  domain,  this  problem  remains:  in  practice,  the 
adopted  distribution  may  be  so  complicated  that  we  cannot  design  the  optimal  code  corre¬ 
sponding  to  it,  and  have  to  use  some  ad  hoc  one  instead. 
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for  Qa  and  Qb>  Suppose  we  obtain  a  sample  generated  by  P*.  A  and  B  both  want 
to  encode  the  sample  using  as  few  bits  as  possible,  but  neither  knows  that  P*  has 
actually  been  used  to  generate  the  sample.  A  decides  to  use  the  code  corresponding 
to  distribution  Qa  and  B  decides  to  use  the  code  corresponding  to  Qb>  Suppose  that 
EP*(-logQA(X))  <  Ep *  (-  log  Qb(X)).  Then,  by  the  law  of  large  numbers,  with 
P* -probability  1,  n~l(-  \ogQj(Xi,  •  •  *  ,Xn))  — ►  Ep*(—  log Qj(X)),  for  both  j  G 
{A,  B}  (note  -  logQj(Xn)  =  -  l°g  Qj(Xi)).  It  follows  that,  with  probability 
1,  A  will  need  less  (linearly  in  n)  bits  to  encode  X\,  •  •  •  ,  Xn  than  B.  0 

The  qualitative  content  of  this  result  is  not  so  surprising:  in  a  large  sample  generated  by 
P,  the  frequency  of  each  x  E  X  will  be  approximately  equal  to  the  probability  P(x). 
In  order  to  obtain  a  short  code  length  for  xn ,  we  should  use  a  code  that  assigns  a  small 
code  length  to  those  symbols  in  X  with  high  frequency  (probability),  and  a  large  code 
length  to  those  symbols  in  X  with  low  frequency  (probability). 

Summary  In  this  section  we  introduced  (prefix)  codes  and  thoroughly  discussed  the 
relation  between  probabilities  and  code  lengths.  We  are  now  almost  ready  to  formalize 
a  simple  version  of  MDL  -  but  first  we  need  to  review  some  concepts  of  statistics. 


4.2  Statistical  preliminaries  and  example  models 

In  the  next  section,  we  formally  introduce  the  crude  form  of  MDL.  We  will  freely 
use  some  convenient  statistical  concepts  which  we  review  in  this  section;  for  details 
see,  for  example,  [15].  We  also  describe  the  model  class  of  Markov  chains  of  arbitrary 
order,  which  we  use  as  our  running  example.  These  admit  a  simpler  treatment  than  the 
polynomials,  to  which  we  return  in  section  4.7. 

Statistical  preliminaries  A  probabilistic  model*  M  is  a  set  of  probabilistic  sources. 
Typically  one  uses  the  word  ‘model’  to  denote  sources  of  the  same  functional  form. 
We  often  index  the  elements  P  of  a  model  M  using  some  parameter  9.  In  that  case 
we  write  P  as  P(* \9),  and  M  as  M  =  {P(*|0)|0  E  ©},  for  some  parameter  space 
0.  If  M  can  be  parameterized  by  some  connected  ©  C  Rfc  for  some  k  >  1  and 
the  mapping  9  — »  P(*|0)  is  smooth  (appropriately  defined),  we  call  M  a  parametric 
model  or  family.  For  example,  the  model  M  of  all  normal  distributions  on  X  =  R  is  a 
parametric  model  that  can  be  parameterized  by  9  =  (/x,  a2)  where  (x  is  the  mean  and 
a2  is  the  variance  of  the  distribution  indexed  by  9.  The  family  of  all  Markov  chains  of 
all  orders  is  a  model,  but  not  a  parametric  model.  We  call  a  model  M  an  i.i.d.  model 
if,  according  to  all  P  G  M,  Xi,  X2,  •  •  *  are  i.i.d.  We  call  M  /c-dimensional  if  k  is  the 
smallest  integer  k  so  that  M  can  be  smoothly  parameterized  by  some  0CRfe, 

For  a  given  model  M  and  sample  D  =  xn ,  the  maximum  likelihood  (ML)  P  is  the 
P  G  M  maximizing  P(xn).  For  a  parametric  model  with  parameter  space  ©,  the 
maximum  likelihood  estimator  9  is  the  function  that,  for  each  n,  maps  xn  to  the  9  G  0 
that  maximizes  the  likelihood  P(xn\9).  The  ML  estimator  may  be  viewed  as  a  ‘learn¬ 
ing  algorithm’.  This  is  a  procedure  that,  when  getting  input  a  sample  xn  of  arbitrary 
length,  outputs  a  parameter  or  hypothesis  Pn  G  M,  We  say  that  a  learning  algorithm 

*  Henceforth,  we  simply  use  ‘model’  to  denote  probabilistic  models;  we  typically  use  H  to 
denote  sets  of  hypotheses  such  as  polynomials,  and  reserve  M  for  probabilistic  models. 
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is  consistent  relative  to  distance  measure  dy  if  for  all  P*  G  My  with  data  distributed 
according  to  P*t  then  the  output  Pn  converges  to  P*  in  the  sense  that  d(P *,  Pn)  0 
with  P* -probability  1.  Thus,  if  P *  is  the  ‘true’  state  of  nature,  then  given  enough  data, 
the  learning  algorithm  will  learn  a  good  approximation  of  P *  with  very  high  probabil¬ 
ity. 

Example  4.7  (Markov  and  Bernoulli  models) 

Recall  that  a  fc-th  order  Markov  chain  on  X  =  {0, 1}  is  a  probabilistic  source  such  that 
for  every  n  >  k, 

P(Xn  =  1  i  Xn — 1  =  %n—  1)  5  Xn—k  —  %n-k)  ~ 

P{Xn  =  l|-X"n— 1  =  ^n—li  ?  Xn—k  ~  ’  *  *  ?  X\  ~  )  •  (4.6) 

That  is,  the  probability  distribution  on  Xn  depends  only  on  the  k  symbols  preceding  n. 
Thus,  there  are  2k  possible  distributions  of  Xn,  and  each  such  distribution  is  identified 
with  a  state  of  the  Markov  chain.  To  fully  identify  the  chain,  we  also  need  to  specify  the 
starting  state,  defining  the  first  k  outcomes  Xi,  •  •  •  ,  X The  k~\h  order  Markov  model 
is  the  set  of  all  fc-th  order  Markov  chains,  i.e.,  all  sources  satisfying  (4.6)  equipped 
with  a  starting  state. 

The  special  case  of  the  0-th  order  Markov  model  is  the  Bernoulli  or  biased  coin  model, 
which  we  denote  by  We  can  parameterize  the  Bernoulli  model  by  a  parameter 
6  e  [0, 1]  representing  the  probability  of  observing  a  1.  Thus,  B (°)  =  {P(-\0)\9  G 
[0, 1]},  with  P{xn\6)  by  definition  equal  to 

n 

P{xn\0)  =  JJPfol#)  =  6>nm(l  -  6>)ni°i, 

i—  1 

where  stands  for  the  number  of  Is,  and  n[0j  for  the  number  of  Os  in  the  sample. 
Note  that  the  Bernoulli  model  is  i.i.d.  and  that  n ^  +  n[0j  =  n.  The  log-likelihood  is 
given  by 

log  P(xn\6)  =  n(1]  log  0  +  n[0]  log(l  -  6).  (4.7) 

Taking  the  derivative  of  (4.7)  with  respect  to  6 ,  we  see  that  for  fixed  xn ,  the  log- 
likelihood  is  maximized  by  setting  the  probability  of  1  equal  to  the  observed  frequency. 
Since  the  logarithm  is  a  monotonically  increasing  function,  the  likelihood  is  maxi¬ 
mized  at  the  same  value:  the  ML  estimator  is  given  by  6{xn)  =  n^/n. 

Similarly,  the  first-order  Markov  model  B M  can  be  parameterized  by  a  vector  9  — 
(0[l|Oj,  0[i|i])  €  [0,  l]2  together  with  a  starting  state  in  {0, 1}.  Here  0[i\j]  represents  the 
probability  of  observing  a  1  following  the  symbol  j .  The  log-likelihood  is  given  by 

logP(zn|<9)  =  «[!!!]  log  <9(21!]  +  ri[o|i]  log(l  -  (fynj) 

+  n[l|0]  l°g^(l|0]  +  n[0|0]  l°g(l  _  $[1|0])> 

denoting  the  number  of  times  outcome  i  is  observed  in  state  (previous  outcome) 
j.  This  is  maximized  by  setting  §  =  (<?[i|0],  <?[i|i])>  with  fyyj  =  n^]  =  n^/n^ 
set  to  the  conditional  frequency  of  i  preceded  by  j.  In  general,  a  fc-th  order  Markov 
chain  has  2k  parameters  and  the  corresponding  likelihood  is  maximized  by  setting  the 
parameter  0^  equal  to  the  number  of  times  i  was  observed  in  state  j  divided  by  the 
number  of  times  the  chain  was  in  state  j.  0 
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Suppose  now  we  are  given  data  D  =  xn  and  we  want  to  find  the  Markov  chain  that 
best  explains  D.  Since  we  do  not  want  to  restrict  ourselves  to  chains  of  fixed  order,  we 
run  a  large  risk  of  overfitting:  simply  picking,  among  all  Markov  chains  of  each  order, 
the  ML  Markov  chain  that  maximizes  the  probability  of  the  data,  we  typically  end  up 
with  a  chain  of  order  n—  1  with  starting  state  given  by  the  sequence  x  \ ,  •  ■  •  ,  xn-  \ ,  and 
P{Xn  =  xn\Xn-\  =  zn_i)  =  1.  Such  a  chain  will  assign  probability  1  to  xn.  Below 
we  show  that  MDL  makes  a  more  reasonable  choice. 


4.3  Crude  MDL 

Based  on  the  information  theoretic  (section  4.1)  and  statistical  (section  4.2)  prelimi¬ 
naries  discussed  before,  we  now  formalize  a  first,  crude  version  of  MDL. 

Let  M  be  a  class  of  probabilistic  sources  (not  necessarily  Markov  chains).  Suppose 
we  observe  a  sample  D  =  (a?i,  •  •  *  ,  xn)  E  Xn.  Recall  ‘the  crude*  two-part  code  MDL 
principle’  from  section  3.3: 

Crude,  two-part  version  of  MDL  principle 

Let  H M ,  Ti^ ,  *  •  *  be  a  set  of  candidate  models.  The  best  point  hypothesis  H  eH^U 
H ®  U  ■  •  *  to  explain  data  D  is  the  one  which  minimizes  the  sum  L(H )  +  L(D\H ), 
where 

#  L(H )  is  the  length,  in  bits,  of  the  description  of  the  hypothesis,  and 

•  L  (D  |  H )  is  the  length,  in  bits,  of  the  description  of  the  data  when  encoded  with 
the  help  of  the  hypothesis. 

The  best  model  to  explain  D  is  the  smallest  model  containing  the  selected  H . 

In  this  section,  we  implement  this  crude  MDL  principle  by  giving  a  precise  definition 
of  the  terms  L(H)  and  L(D\H).  To  make  the  first  term  precise,  we  must  design  a  code 
Ci  for  encoding  hypotheses  H  such  that  L(H )  =  Lcx  ( H ).  For  the  second  term,  we 
must  design  a  set  of  codes  C2jh  (one  for  each  H  E  M)  such  that  for  all  D  E  Xn , 
L(D\H)  =  L>c2  h  ( D ).  We  start  by  describing  the  codes  C2,k* 


4.3.1  Description  length  of  data  given  hypotheses 

Given  a  sample  of  size  n,  each  hypothesis  H  may  be  viewed  as  a  probability  distri¬ 
bution  on  Xn .  We  denote  the  corresponding  probability  mass  function  by  P(*|i?).  We 
need  to  associate  with  P(- 1 H)  a  code,  or  really,  just  a  code  length  function  for  Xn.  We 
already  know  that  there  exists  a  code  with  length  function  L  such  that  for  all  xn  E  Xn> 
L(xn )  =  -  log  P(xn\ H).  This  is  the  code  that  we  will  pick.  It  is  a  natural  choice  for 
two  reasons: 

1 .  With  this  choice,  the  code  length  L  (xn  \  H )  is  equal  to  minus  the  log-likelihood 
of  xn  according  to  H ,  which  is  a  standard  statistical  notion  of  ‘goodness-of- 
fit\ 

*  The  terminology  ‘crude  MDL’  is  not  standard.  It  is  introduced  here  for  pedagogical  reasons, 
to  clarify  the  importance  of  having  a  single,  unified  principle  for  designing  codes.  It  should  be 
noted  that  Rissanen’s  and  Barron’s  early  theoretical  papers  on  MDL  already  contain  such  prin¬ 
ciples,  albeit  in  a  slightly  different  form  than  in  their  recent  papers.  Early  practical  applications 
[83],  [43]  often  do  use  ad  hoc  two-part  codes  which  really  are  ‘crude*  in  the  sense  defined  here. 
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2.  If  the  data  turn  out  to  be  distributed  according  to  P ,  then  the  code  L(-\H)  will 
uniquely  minimize  the  expected  code  length  (section  4.1). 

The  second  item  implies  that  our  choice  is,  in  a  sense,  the  only  rea¬ 
sonable  choice^.  To  see  this,  suppose  Ad  is  a  finite  i.i.d.  model  con¬ 
taining,  say,  M  distributions.  Suppose  we  assign  an  arbitrary  but  fi¬ 
nite  code  length  L{H)  to  each  H  G  M.  Suppose  Xi,  X2,  •  •  •  are 
actually  distributed  i.i.d.  according  to  some  ‘true’  H*  e  Ad.  By  the 
reasoning  of  example  4.6,  we  see  that  MDL  will  select  the  true  dis¬ 
tribution  P(-\H*)  for  all  large  n,  with  probability  1.  This  means  that 
MDL  is  consistent  for  finite  Ad.  If  we  were  to  assign  codes  to  distribu¬ 
tions  in  some  other  manner  not  satisfying  L(D\H)  =  -  log  P(D\H), 
then  there  would  exist  distributions  P(*|P)  such  that  L(D\H)  ^ 

-  log  P(D\H).  But  by  observation  4.1,  there  must  be  some  distribu¬ 
tion  P('\Hf)  with  L(-\Hf)  =  -  log  P(  \Hf).  Now  let  Ad  =  {H,  Hf} 
and  suppose  data  are  distributed  according  to  P(-|iP).  Then,  by  the 
reasoning  of  example  4.6,  MDL  would  select  H  rather  than  H 9  for  all 
large  n!  Thus,  MDL  would  be  inconsistent  even  in  this  simplest  of  all 
imaginable  cases  -  there  would  then  be  no  hope  for  good  performance 
in  the  considerably  more  complex  situations  we  intend  to  use  it  for^. 

4.3.2  Description  length  of  hypotheses 

In  its  weakest  and  crudest  form,  the  two-part  code  MDL  principle  does  not  give  any 
guidelines  as  to  how  to  encode  hypotheses  (probability  distributions).  Every  code  for 
encoding  hypotheses  is  allowed,  as  long  as  such  a  code  does  not  change  with  the  sam¬ 
ple  size  n. 

To  see  the  danger  in  allowing  codes  to  depend  on  n,  consider  the 
Markov  chain  example:  if  we  were  allowed  to  use  different  codes  for 
different  n,  we  could  use,  for  each  n,  a  code  assigning  a  uniform  dis¬ 
tribution  to  all  Markov  chains  of  order  n  —  1  with  all  parameters  equal 
to  0  or  1.  Since  there  are  only  a  finite  number  (2n_1)  of  these,  this  is 
possible.  But  then,  for  each  n ,  xn  e  Xn,  MDL  would  select  the  ML 
Markov  chain  of  order  n  —  1.  Thus,  MDL  would  coincide  with  ML 
and,  no  matter  how  large  n,  we  would  overfit. 

Consistency  of  two-part  MDL  Remarkably,  if  we  fix  an  arbitrary  code  for  all  hy¬ 
potheses,  identical  for  all  sample  sizes  n,  this  is  sufficient  to  make  MDL  consistent^ 
for  a  wide  variety  of  models,  including  the  Markov  chains.  For  example,  let  L  be  the 
length  function  corresponding  to  some  code  for  the  Markov  chains.  Suppose  some 
Markov  chain  P*  generates  the  data  such  that  L(P*)  <  1  under  our  coding  scheme. 
Then,  loosely  speaking,  for  every  P*  of  every  order,  with  probability  1  there  exists 
some  no  such  that  for  all  samples  larger  than  no,  two-part  MDL  will  select  P*  -  here 
no  may  depend  on  P*  and  L. 

While  this  result  indicates  that  MDL  may  be  doing  something  sensible,  it  certainly 
does  not  justify  the  use  of  arbitrary  codes  -  different  codes  will  lead  to  preferences  of 

1  But  see  chapter  6  for  more  discussion. 

*  See  section  3.5  of  chapter  3  for  a  discussion  on  the  role  of  consistency  in  MDL. 

§  See,  for  example  [8],  [6]. 
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different  hypotheses,  and  it  is  not  at  all  clear  how  a  code  should  be  designed  that  leads 
to  good  inferences  with  small,  practically  relevant  sample  sizes. 

Barron  and  Cover  [8]  have  developed  a  precise  theory  of  how  to  design  codes  C\ 
in  a  ‘clever’  way,  anticipating  the  developments  of  ‘refined  MDL\  Practitioners  have 
often  simply  used  ‘reasonable’  coding  schemes,  based  on  the  following  idea.  Usually 
there  exists  some  ‘natural’  decomposition  of  the  models  under  consideration,  M  = 
Ufc>o-^fc)  where  the  dimension  of  M ^  grows  with  fc  but  is  not  necessarily  equal 
to  k.  In  the  Markov  chain  example,  we  have  B  =  {J  B where  B is  the  fc-th  order, 
2fc-parameter  Markov  model.  Then  within  each  submodel  M  (k\  we  may  use  a  fixed- 
length  code  for  9  E  ©(fe).  Since  the  set  ©W  is  typically  a  continuum,  we  somehow 
need  to  discretize  it  to  achieve  this. 

Example  4.8  (A  very  crude  code  for  the  Markov  chains) 

We  can  describe  a  Markov  chain  of  order  k  by  first  describing  k,  and  then  describ¬ 
ing  a  parameter  vector  9  E  [0,  l]fc/  with  fc'  =  2fc.  We  describe  k  using  our  simple 
code  for  the  integers  (example  4.4).  This  takes  2  log  k  +  1  bits.  We  now  have  to  de¬ 
scribe  the  fc'-component  parameter  vector.  We  saw  in  example  4.7  that  for  any  xn , 
the  best-fitting  (ML)  fc-th  order  Markov  chain  can  be  identified  with  fc'  frequencies. 
It  is  not  hard  to  see  that  these  frequencies  are  uniquely  determined  by  the  counts 
n[i|o -oo]jn[i|o -oi]j  *  •  ‘  j  n[i|i-ii]  •  Each  individual  count  must  be  in  the  (n-fl)-element 
set  {0, 1,  •  •  ■  ,n},  Since  we  assume  n  is  given  in  advance^,  we  may  use  a  simple  fixed- 
length  code  to  encode  this  count,  taking  log(n  +  1)  bits  (example  4.3).  Thus,  once  fc 
is  fixed,  we  can  describe  such  a  Markov  chain  by  a  uniform  code  using  fc'  log(n  +  1) 
bits.  With  the  code  just  defined  we  get  for  any  P  E  B,  indexed  by  parameter 

L(P)  =  L(k,  ©(*>)  =  2  log  fc  +  1  +  k  log(n  +  1), 
so  that  with  these  codes,  MDL  tells  us  to  pick  the  k,  6 ^  minimizing 
L(k,  6^)  +  L(D\k ,  0W)  =  2  log  fc  +  1  +  k  log (n  +  1)  -  log  P(D\k,  $&>),  (4.8) 
where  the  9 ^  that  is  chosen  will  be  equal  to  the  ML  estimator  for  M^k\  0 


Why  not  this  code?  We  may  ask  two  questions  about  this  code.  First,  why  did  we  only 
reserve  code  words  for  6  that  are  potentially  ML  estimators  for  the  given  data?  The 
reason  is  that,  given  fc'  =  2fc,  the  code  length  L(D|fc,  9^)  is  minimized  by  §(k\D), 
the  ML  estimator  within  9^k\  Reserving  code  words  for  9  E  [0,  l]fc/  that  cannot  be  ML 
estimates  would  only  serve  to  lengthen  L(D|fc,  9 W)  and  can  never  shorten  L(k\9^). 
Thus,  the  total  description  length  needed  to  encode  D  will  increase.  Since  our  stated 
goal  is  to  minimize  description  lengths,  this  is  undesirable. 

*  Strictly  speaking,  the  assumption  that  n  is  given  in  advance  (i.e.,  both  encoder  and  decoder 
know  n)  contradicts  the  earlier  requirement  that  the  code  to  be  used  for  encoding  hypotheses  is 
not  allowed  to  depend  on  n.  Thus,  we  should  first  encode  some  n  explicitly,  using  2  log  n  4- 1 
bits  (example  4.4),  and  then  pick  the  n  (typically,  but  not  necessarily  equal  to  the  actual  sample 
size)  that  allows  for  the  shortest  three-part  code  length  of  the  data  (first  encode  n,  then  (fc,  9), 
then  the  data).  In  practice  this  will  not  significantly  alter  the  chosen  hypothesis,  unless  for  some 
quite  special  data  sequences. 
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However,  by  the  same  logic  we  may  also  ask  whether  we  have  not  reserved  too  many 
code  words  for  9  G  [0,  l]fc\  And  in  fact,  it  turns  out  that  we  have:  the  distance  between 
two  adjacent  ML  estimators  is  0(~),  Indeed,  if  we  had  used  a  coarser  precision,  only 
reserving  code  words  for  parameters  with  distances  O(^),  we  would  obtain  smaller 
code  lengths  -  (4.8)  would  become 

L(k,  0<fc>)  +  L(D\k,  0<*>)  =  -  log  P(D\k,  §W)  +  §  log n  +  ck ,  (4.9) 

where  c/~  is  a  small  constant  depending  on  k ,  but  not  on  n  [8].  In  section  4.5  we  show 
that  (4.9)  is  in  some  sense  ‘optimal’. 

The  good  news  and  the  bad  news  The  good  news  is  (1)  we  have  found  a  principled, 
non-arbitrary  manner  to  encode  data  D  given  a  probability  distribution  H ,  namely,  to 
use  the  code  with  lengths  —  log P(D\H)\  and  (2),  asymptotically,  any  code  for  hy¬ 
potheses  will  lead  to  a  consistent  criterion.  The  bad  news  is  that  we  have  not  found 
clear  guidelines  to  design  codes  for  hypotheses  H  G  M.  We  found  some  intuitively 
reasonable  codes  for  Markov  chains,  and  we  then  reasoned  that  these  could  be  some¬ 
what  ‘improved’,  but  what  is  conspicuously  lacking  is  a  sound  theoretical  principle  for 
designing  and  improving  codes. 

We  take  the  good  news  to  mean  that  our  idea  may  be  worth  pursing  further.  We  take 
the  bad  news  to  mean  that  we  do  have  to  modify  or  extend  the  idea  to  get  a  meaningful, 
non-arbitrary  and  practically  relevant  model  selection  method.  Such  an  extension  was 
already  suggested  in  Rissanen’s  early  works  [88],  [89]  and  refined  by  Barron  and  Cover 
[8],  However,  in  these  works,  the  principle  was  still  restricted  to  two-part  codes.  To  get 
a  fully  satisfactory  solution,  we  need  to  move  to  ‘universal  codes’,  of  which  the  two- 
part  codes  are  merely  a  special  case. 


4.4  Information  theory  II:  universal  codes  and  models 

We  have  just  indicated  why  the  two-part  code  formulation  of  MDL  needs  to  be  re¬ 
fined.  It  turns  out  that  the  key  concept  we  need  is  that  of  universal  coding.  Loosely 
speaking,  a  code  L  that  is  universal  relative  to  a  set  of  candidate  codes  £  allows  us 
to  compress  every  sequence  xn  almost  as  well  as  the  code  in  £  that  compresses  that 
particular  sequence  most.  Two-part  codes  are  universal  (section  4.4.1),  but  there  exist 
other  universal  codes  such  as  the  Bayesian  mixture  code  (section  4.4.2)  and  the  nor¬ 
malized  maximum  likelihood  (NML)  code  (section  4.4.3).  We  also  discuss  universal 
models,  which  are  just  the  probability  distributions  corresponding  to  universal  codes. 
In  this  section,  we  are  not  concerned  with  learning  from  data;  we  only  care  about  com¬ 
pressing  data  as  much  as  possible.  We  connect  our  findings  with  learning  in  section 
4.5. 

Coding  as  communication  Like  many  other  topics  in  coding,  ‘universal  coding’  can 
best  be  explained  if  we  think  of  descriptions  as  messages:  we  can  always  view  a  de¬ 
scription  as  a  message  that  some  sender  or  encoder,  say  A,  sends  to  some  receiver  or 
decoder,  say  B.  Before  sending  any  messages,  A  and  B  meet  in  person.  They  agree 
on  the  set  of  messages  that  A  may  send  to  B.  "Typically,  this  will  be  the  set  Xn  of 
sequences  •  *  •  ,  xn ,  where  each  xx  is  an  outcome  in  the  space  X.  They  also  agree 
upon  a  (prefix)  code  that  will  be  used  by  A  to  send  his  messages  to  B.  Once  this  has 
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been  done,  A  and  B  go  back  to  their  respective  homes  and  A  sends  his  messages  to  B 
in  the  form  of  binary  strings.  The  unique  decodability  property  of  prefix  codes  implies 
that,  when  B  receives  a  message,  he  should  always  be  able  to  decode  it  in  a  unique 
manner. 

Universal  coding  Suppose  our  encoder/sender  is  about  to  observe  a  sequence  xn  G 
Xn  which  he  plans  to  compress  as  much  as  possible.  Equivalently,  he  wants  to  send 
an  encoded  version  of  xn  to  the  receiver  using  as  few  bits  as  possible.  Sender  and 
receiver  have  a  set  of  candidate  codes  C  for  Xn  available*.  They  believe  or  hope  that 
one  of  these  codes  will  allow  for  substantial  compression  of  xn.  However,  they  must 
decide  on  a  code  for  Xn  before  sender  observes  the  actual  xn ,  and  they  do  not  know 
which  code  in  C  will  lead  to  good  compression  of  the  actual  xn.  What  is  the  best 
thing  they  can  do?  They  may  be  tempted  to  try  the  following:  upon  seeing  xn,  sender 
simply  encodes/sends  xn  using  the  L  G  C  that  minimizes  L(xn)  among  all  L  G  C. 
But  this  naive  scheme  will  not  work:  since  decoder/receiver  does  not  know  what  xn 
has  been  sent  before  decoding  the  message,  he  does  not  know  which  of  the  codes  in 
C  has  been  used  by  sender/encoder.  Therefore,  decoder  cannot  decode  the  message: 
the  resulting  protocol  does  not  constitute  a  uniquely  decodable,  let  alone  a  prefix  code. 
Indeed,  as  we  show  below,  in  general  no  code  L  exists  such  that  for  all  xn  G  Xny 
L(xn)  <  min l^c  L(xn):  in  words,  there  exists  no  code  which,  no  matter  what  xn  is, 
always  mimics  the  best  code  for  xn. 

Example  4.9 

Suppose  we  think  that  our  sequence  can  be  reasonably  well-compressed  by  a  code 
corresponding  to  some  biased  coin  model.  For  simplicity,  we  restrict  ourselves  to  a 
finite  number  of  such  models.  Thus,  let  C  —  {Li,  •  •  •  ,  Lg}  where  L\  is  the  code 
length  function  corresponding  to  the  Bernoulli  model  P(-\0)  with  parameter  9  =  0.1, 
1/2  corresponds  to  9  —  0.2  and  so  on.  From  (4.7)  we  see  that,  for  example, 

L$(xn)  =  —  logP(xn|0.8)  =  — ri[o]  log 0.2  -  log 0.8 

Lg(xn)  =  —  logP(xn|0.9)  =  -n^o]  log 0.1  -  log0.9. 

Both  L$(xn)  and  Lg(xn)  are  linearly  increasing  in  the  number  of  Is  in  xn.  However, 
if  the  frequency  — 1  is  approximately  0.8,  then  miuLeC  L{xn )  will  be  achieved  for  Lg. 
If  ^  «  0.9  then  min LzcL(xn)  is  achieved  for  Lg.  More  generally,  if  ^  ~  To  then 
Lj  achieves  the  minimum^.  We  would  like  to  send  xn  using  a  code  L  such  that  for  all 
xn9  we  need  at  most  L(xn)  bits,  where  L(xn )  is  defined  as  L(xn )  :=  min LeC  L(xn). 
Since  —  log  is  monotonically  decreasing,  L(xn)  =  —  logP(xn\9(xn)).  We  already 
gave  an  informal  explanation  why  a  code  with  lengths  L  does  not  exist.  We  can  now 
explain  this  more  formally  as  follows:  if  such  a  code  were  to  exist,  it  would  corre¬ 
spond  to  some  distribution  P,  Then  we  would  have  for  all  xn,  L(xn)  =  —  log  P(xn ). 

*  As  explained  in  observation  4.2,  we  identify  these  codes  with  their  length  functions,  which 
is  the  only  aspect  we  are  interested  in. 

t  The  reason  is  that,  in  the  full  Bernoulli  model  with  parameter  9  G  [0, 1],  the  maximum 
likelihood  estimator  is  given  by  see  example  4.7.  Since  the  likelihood  log P(xn\0)  is  a 
continuous  function  of  9,  this  implies  that  if  the  frequency  in  xn  is  approximately  (but  not 
precisely)  ^j,  then  the  ML  estimator  in  the  restricted  model  {0.1,  ••  •  ,0.9}  is  still  given  by 
0  =  -  T^en  P(x1l\Q)  IS  maximized  by  0  —  so  that  the  L  G  C  that  minimizes  code 

length  corresponds  to  9  =  ^ . 
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But,  by  definition,  for  all  xn  G  Xn ,  L(xn)  <  L(xn)  =  ~logP(xn\d(xn))  where 
6(xn)  G  {0.1,***  ,0.9}.  Thus  we  get  for  all  xn,  -  log P(xn)  <  -\ogP(xn\9(xn)) 
or  P(xn )  >  P(xn|0(a;n)),  so  that,  since  \C\  >  1, 

Y P(xn)  >  Y P{xn\e{xn))  =  Ym^ P{xn\e)  >i,  (4. io) 

xn  xn  xn 

where  the  last  inequality  follows  because  for  any  two  6 j,  02  with  B\  ^  there  is  at 
least  one  xn  with  P(xn\9i)  >  P(xn |02)-  Equation  (4.10)  says  that  P  is  not  a  prob¬ 
ability  distribution.  It  follows  that  L  cannot  be  a  code  length  function.  The  argument 
can  be  extended  beyond  the  Bernoulli  model  of  the  example  above:  as  long  as  \C\  >  1, 
and  all  codes  in  C  correspond  to  a  non-defective  distribution,  (4.10)  must  still  hold, 
so  that  there  exists  no  code  L  with  L(xn)  =  L{xn )  for  all  xn.  The  underlying  reason 
that  no  such  code  exists  is  the  fact  that  probabilities  must  sum  up  to  something  <  1;  or 
equivalently,  that  there  exists  no  coding  scheme  assigning  short  code  words  to  many 
different  messages  -  see  example  4.2.  0 

Since  there  exists  no  code  which,  no  matter  what  xn  is,  always  mimics  the  best  code  for 
xn ,  it  may  make  sense  to  look  for  the  next  best  thing:  does  there  exist  a  code  which, 
for  all  xn  G  Xn,  is  ‘nearly’  (in  some  sense)  as  good  as  L(xn)l  It  turns  out  that  in 
many  cases,  the  answer  is  yes:  there  typically  exists  codes  L  such  that  no  matter  what 
xn  arrives,  L(xn)  is  not  much  larger  than  L(xn ),  which  may  be  viewed  as  the  code 
that  is  best  ‘with  hindsight’  (i.e.,  after  seeing  xn ).  Intuitively,  codes  which  satisfy  this 
property  are  called  universal  codes  -  a  more  precise  definition  follows  below.  The  first 
(but  perhaps  not  foremost)  example  of  a  universal  code  is  the  two-part  code  that  we 
have  encountered  in  section  4.3. 

4.4.1  Two-part  codes  as  simple  universal  codes 


Example  4.10  (Finite  C) 

Let  C  be  as  in  example  4.9.  We  can  devise  a  code  Z2_p  for  all  xn  G  Xn  as  follows: 
to  encode  xn,  we  first  encode  the  j  G  {1,  •  *  •  ,9}  such  that  Lj(xn)  =  mln^c  L(xn), 
using  a  uniform  code.  This  takes  log  9  bits.  We  then  encode  xn  itself  using  the  code 
indexed  by  j.  This  takes  Lj  bits.  Note  that  in  contrast  to  the  naive  scheme  discussed 
in  example  4.9,  the  resulting  scheme  properly  defines  a  prefix  code:  a  decoder  can 
decode  xn  by  first  decoding  j,  and  then  decoding  xn  using  Lj.  Thus,  for  every  possible 
xn  G  Xn ,  we  obtain 

L2~p(xn)  =  min  L(xn)  +  log  9. 

ForallL  G  P,minxn  L(xn)  grows  linearly  inn:  min0jXn{—  log  P(:rn|#)}  =  — nlog0.9^ 
0.15n.  Unless  n  is  very  small,  no  matter  what  xn  arises,  the  extra  number  of  bits  we 
need  using  Z2-P  compared  to  L(xn)  is  negligible.  0 

More  generally,  let£  —  {Lj,-*  -  ,Lm}  where  M  can  be  arbitrarily  large,  and  the  Lj 
can  be  any  code  length  functions  we  like;  they  do  not  necessarily  represent  Bernoulli 
distributions  any  more.  By  the  reasoning  of  example  4.10,  there  exists  a  (two-part) 
code  such  that  for  all  xn  G  Xn , 

L2_p(xn)  =  min  L(xn)  +  lo  gM. 


(4.11) 
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In  most  applications  min  L(xn)  grows  linearly  in  n,  and  we  see  from  (4.11)  that,  as 
soon  as  n  becomes  substantially  larger  than  logM,  the  difference  in  performance  be¬ 
tween  our  universal  code  and  L(xn)  becomes  negligible.  In  general,  we  do  not  always 
want  to  use  a  uniform  code  for  the  elements  in  £;  note  that  any  arbitrary  code  on  £ 
will  give  us  an  analogue  of  (4.1 1),  but  with  a  worst-case  overhead  larger  than  log  M  - 
corresponding  to  the  largest  code  length  of  any  of  the  elements  in  £. 

Example  4.11  (Countable  infinite  £) 

We  can  also  construct  a  2-part  code  for  arbitrary  countably  infinite  sets  of  codes  £  = 
{Li,  £2 5  •  -  *  }•  we  first  encode  some  k  using  our  simple  code  for  the  integers  (example 
4.4).  With  this  code  we  need  2  log  k  +  1  bits  to  encode  integer  fc.  We  then  encode  xn 
using  the  code  Lk.  L2~p  is  now  defined  as  the  code  we  get  if,  for  any  xn9  we  encode 
xn  using  the  Lk  minimizing  the  total  two-part  description  length  2  log  k  + 1  +  Lk(xn). 

In  contrast  to  the  case  of  finite  £,  there  does  not  exist  a  constant  c  any  more  such  that  for 
all  n,  xn  E  Xn ,  L2-P(xn)  <  inf L(xn)+c.  Instead  we  have  the  following  weaker, 
but  still  remarkable  property:  for  all  k ,  all  n,  all  xn9  L2-p(xn)  <  Lk(xn)  +  2  log  k  + 1. 
Therefore,  we  also  get 

L2-P(xn)  <  inf  L(xn )  +  2  log  k  +  1. 

L€{Ll9~M 

For  any  k9  as  n  grows  larger,  the  code  L2-p  starts  to  mimic  whatever  L  €  {Li,  •  •  •  ;  Lk} 
compresses  the  data  most.  However,  the  larger  fc,  the  larger  n  has  to  be  before  this  hap¬ 
pens.  0 


4.4.2  From  universal  codes  to  universal  models 

Instead  of  postulating  a  set  of  candidate  codes  £,  we  may  equivalently  postulate  a  set 
M  of  candidate  probabilistic  sources,  such  that  £  is  the  set  of  codes  corresponding  to 
M .  We  already  implicitly  did  this  in  example  4.9. 

The  reasoning  is  now  as  follows:  we  think  that  one  of  the  P  £  M  will  assign  a  high 
likelihood  to  the  data  to  be  observed.  Therefore  we  would  like  to  design  a  code  that,  for 
all  xn  we  might  observe,  performs  essentially  as  well  as  the  code  corresponding  to  the 
best-fitting,  maximum  likelihood  (minimum  code  length)  P  €  M  for  xn.  Similarly, 
we  can  think  of  universal  codes  such  as  the  two-part  code  in  terms  of  the  (possibly 
defective,  see  section  4.1  and  observation  4.1)  distributions  corresponding  to  it.  Such 
distributions  corresponding  to  universal  codes  are  called  universal  models.  The  use  of 
mapping  universal  codes  back  to  distributions  is  illustrated  by  the  Bayesian  universal 
model  which  we  now  introduce. 

Universal  model:  twice  misleading  terminology 

The  words  ‘universal’  and  ‘model’  are  somewhat  of  a  misnomer:  first, 
these  codes/models  are  only  ‘universal’  relative  to  a  restricted  ‘uni¬ 
verse’  M.  Second,  the  use  of  the  word  ‘model’  will  be  very  confusing 
to  statisticians,  who  (as  we  also  do  in  this  chapter)  call  a  family  of 
distributions  such  as  M  a  ’model’.  But  the  phrase  originates  from  in¬ 
formation  theory,  where  a  ‘model’  often  refers  to  a  single  distribution 
rather  than  a  family.  Thus,  a  ‘universal  model’  is  a  single  distribution, 
representing  a  statistical  ‘model’  M . 
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Example  4.12  (Bayesian  universal  model) 

Let  M  be  a  finite  or  countable  set  of  probabilistic  sources,  parameterized  by  some 
parameter  set  0.  Let  W  be  a  distribution  on  0.  Adopting  terminology  from  Bayesian 
statistics,  W  is  usually  called  a  prior  distribution.  We  can  construct  a  new  probabilistic 
source  PBayes  by  taking  a  weighted  (according  to  W)  average  or  mixture  over  the 
distributions  in  AT  That  is,  we  define  for  all  n,  xn  G  X, 

PBayes(xn)  :=  £  P(xn\9) W(9).  (4.12) 

0G© 

It  is  easy  to  check  that  Psayes  is  a  probabilistic  source  according  to  our  definition. 
In  case  0  is  continuous,  the  sum  gets  replaced  by  an  integral  but  otherwise  nothing 
changes  in  the  definition.  In  Bayesian  statistics,  PBayes  is  called  the  Bayesian  marginal 
likelihood  or  Bayesian  mixture  [10].  To  see  that  Psayes  is  a  universal  model,  note  that 
for  all  i)E0, 

-log  PBayes(xn)  :=  -  log  £  P(xn\9)W(9)  <  -  log  P(xn\0)  +  c* ,  (4.13) 

dee 

where  the  inequality  follows  because  a  sum  is  at  least  as  large  as  each  of  its  terms, 
and  c$  —  log  W (#)  depends  on  i3  but  not  on  n .  Thus,  Psayes  is  a  universal  model 
or  equivalently,  the  code  with  lengths  -  log  PBayes  is  a  universal  code.  Note  that  the 
derivation  in  (4.13)  only  works  if  0  is  finite  or  countable;  the  case  of  continuous  0  is 
treated  in  section  4.5.  0 

Bayes  is  better  than  two-part  The  Bayesian  model  is  in  a  sense  superior  to  the  two- 
part  code.  Namely,  in  the  two-part  code  we  first  encode  an  element  of  M  or  its  parame¬ 
ter  set  0  using  some  code  Lq.  Such  a  code  must  correspond  to  some  ‘prior’  distribution 
W  on  A4  so  that  the  two-part  code  gives  code  lengths 

L2-P{xn)  =  min{-log.P(a:n|0)  -  logW(<?)},  (4.14) 

dee 

where  W  depends  on  the  specific  code  Lq  that  was  used.  Using  the  Bayes  code  with 
prior  W,  we  get  as  in  (4.13), 

-  log  PBayes(xn)  -  -  log  ^  P(xn\0)W(9)  <  min{- log  P(xn\9)  -  log  W (9)}. 

dee  e~ 

The  inequality  becomes  strict  whenever  P(xn\0)  >  0  for  more  than  one  value  of  9, 
Comparing  to  (4.14),  we  see  that  in  general  the  Bayesian  code  is  preferable  over  the 
two-part  code:  for  all  xn  it  never  assigns  code  lengths  larger  than  L2-P(xn),  and  in 
many  cases  it  assigns  strictly  shorter  code  lengths  for  some  xn.  But  this  raises  two 
important  issues:  (1)  what  exactly  do  we  mean  by  ‘better’  anyway?  (2)  can  we  say  that 
‘some  prior  distributions  are  better  than  others’?  These  questions  are  answered  below. 

4.43  NML  as  an  optimal  universal  model 

We  can  measure  the  performance  of  universal  models  relative  to  a  set  of  candidate 
sources  M  using  the  regret: 
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Definition  4.1  (Regret) 

Let  M  be  a  class  of  probabilistic  sources.  Let  P  be  a  probability  distribution  on  Xn 
(P  is  not  necessarily  in  M ).  For  given  xn9  the  regret ,  TZ,  of  P  relative  to  M  is  defined 
as 


TZ  —  —  lo  gP(xn)  —  unn^{— log  P(xn)}. 


(4.15) 


The  regret  of  P  relative  to  M  for  xn  is  the  additional  number  of  bits  needed  to  encode 
xn  using  the  code/distribution  P,  as  compared  to  the  number  of  bits  that  had  been 
needed  if  we  had  used  code/distribution  in  M  that  was  optimal  (‘best-fitting’)  with 
hind-sight.  For  simplicity,  from  now  on  we  tacitly  assume  that  for  all  the  models  M 
we  work  with,  there  is  a  single  9(xn)  maximizing  the  likelihood  for  every  xn  e  Xn. 
In  that  case  (4.15)  simplifies  to 

TZ  =  -logP(xn)  -  {-logP(xn\6{xn))}. 

We  would  like  to  measure  the  quality  of  a  universal  model  P  in  terms  of  its  regret. 
However,  P  may  have  small  (even  <  0)  regret  for  some  xn,  and  very  large  regret 
for  other  xn.  We  must  somehow  find  a  measure  of  quality  that  takes  into  account  all 
xn  £  Xn.  We  take  a  worst-case  approach,  and  look  for  universal  models  P  with  small 
worst-case  regret,  where  the  worst-case  is  over  all  sequences.  Formally,  the  maximum 
or  worst-case  regret  of  P  relative  to  M  is  defined  as 

nmax(P)  :=  max  {-lo gP(xn)  -  {~logP(xn\e(xn))}}. 

xn€Xn 

If  we  use  TZmax  as  our  quality  measure,  then  the  ‘optimal’  universal  model  relative  to 
M ,  for  given  sample  size  n,  is  the  distribution  minimizing 

mmTZmax(P)  =  min  maxn{- logP(xn)  -  {-  logP(xn\9(xn))}},  (4.16) 

where  the  minimum  is  over  all  defective  distributions  on  Xn.  The  P  minimizing  (4.16) 
corresponds  to  the  code  minimizing  the  additional  number  of  bits  compared  to  code  in 
M  that  is  best  in  hindsight  in  the  worst-case  over  all  possible  xn.  It  turns  out  that  we 
can  solve  for  P  in  (4.16).  To  this  end,  we  first  define  the  complexity  of  a  given  model 
M  as 

COMPn(A4)  :=  log  ^  P(xn\9(xn)).  (4.17) 

xn£Xn 

This  quantity  plays  a  fundamental  role  in  refined  MDL,  section  4.6.  To  get  a  first  idea 
of  why  COMPn  is  called  model  complexity,  note  that  the  more  sequences  xn  with 
large  P(xn\6(xn)),  the  larger  COMPn(A4).  In  other  words,  the  more  sequences  that 
can  be  fit  well  by  an  element  of  M9  the  larger  APs  complexity. 

Proposition  4.1  (Shtarkov  [104]) 

Suppose  that  COMPn(A4)  is  finite.  Then  the  minimax  regret  (4.16)  is  uniquely  achieved 
for  the  distribution  Pnmi  given  by 


Pnml(xn) 


EyneXn  ^W))  ' 


(4.18) 
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The  distribution  Pnmi  is  known  as  the  Shtarkov  distribution  or  the  normalized  maxi¬ 
mum  likelihood  (NML)  distribution. 

Proof  proposition  4.1:  Plug  in  Pnmi  in  (4.16)  and  notice  that  for  all  xn  E  Xny 

-  log  Pnmi{xn)  -  {-  logP(xn\0(xn))}  =  Hmax(Pnmi)  =  COMP n(M),  (4.19) 

so  that  Pnrni  achieves  the  same  regret,  equal  to  COMP n(M),  no  matter  what  xn 
actually  obtains.  Since  every  distribution  P  on  Xn  with  P  ^  Pnml  must  satisfy 
P(zn)  <  Pnmi(zn)  for  at  least  one  zn  E  X71,  it  follows  that 

nmax(P)  >  —  log P(zn)  +  \og P(zn\6(zn)) 

>  -  log  Pnml{zn)  +  log  P(zn\0(zn))  =  Tlmax(Pnml)  .D 

Pnml  is  quite  literally  a  ‘normalized  maximum  likelihood’  distribution:  it  tries  to  assign 
to  each  xn  the  probability  of  xn  according  to  the  ML  distribution  for  xn.  By  (4.10), 
this  is  not  possible:  the  resulting  ‘probabilities’  add  to  something  larger  than  1.  But 
we  can  normalize  these  ‘probabilities’  by  dividing  by  their  P(yn\@(yn))y  anc* 

then  we  obtain  a  probability  distribution  on  Xn  after  all. 

Whenever  X  is  finite,  the  sum  COMPn(A4)  is  finite  so  that  the  NML  distribution  is 
well-defined.  If  X  is  countably  infinite  or  continuous-valued,  the  sum  COMPn(A4) 
may  be  infinite  and  then  the  NML  distribution  may  be  undefined.  In  that  case,  there 
exists  no  universal  model  achieving  constant  regret  as  in  (4.19).  If  M  is  parametric, 
then  Pnmi  is  typically  well-defined  as  long  as  we  suitably  restrict  the  parameter  space. 
The  parametric  case  forms  the  basis  of  ‘refined  MDL’  and  will  be  discussed  at  length 
in  the  next  section. 

Summary:  Universal  codes  and  models 

Let  M  be  a  family  of  probabilistic  sources.  A  universal  model  in  an  individual  se¬ 
quence  sense^relative  to  M,  in  this  text  simply  called  a  ‘universal  model  for  M\  is  a 
sequence  of  distributions  P^l\  P(2\  •  *  •  on  X1,  X2,  -  •  ■  respectively,  such  that  for  all 
P  E  M  and  e  >  0, 

max  ^  {  —  log  p(n\xn)  -  (-logP(xn))}  <  e  as  n  — ►  oo. 

Multiplying  both  sides  with  n  we  see  that  P  is  universal  if  for  every  P  E  M,  the  code 
length  difference  —  logP(xn)  +  lo gP(xn)  increases  linearly  in  en.  If  M  is  finite, 
then  the  two-part,  Bayes  and  NML  distributions  are  universal  in  a  very  strong  sense: 
rather  than  just  increasing  sublinearly,  the  code  length  difference  is  bounded  by  a  con¬ 
stant.  We  already  discussed  two-part,  Bayesian  and  minimax  optimal  (NML)  universal 
models,  but  there  are  several  other  types.  We  mention  prequential  universal  models 
(section  4.5.4),  the  Kolmogorov  universal  model,  conditionalized  two-part  codes  [97] 
and  Cesaro-average  codes  [9]. 


*  What  we  call  ‘universal  model’  in  this  text  is  known  in  the  literature  as  a  ‘universal  model 
in  the  individual  sequence  sense’  -  there  also  exist  universal  models  in  an  ‘expected  sense’,  see 
section  4.8.1.  These  lead  to  slightly  different  versions  of  MDL. 
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4.5  Simple  refined  MDL  and  its  four  interpretations 

In  section  4.3,  we  indicated  that  ‘crude’  MDL  needs  to  be  refined.  In  section  4.4  we 
introduced  universal  models.  We  now  show  how  universal  models,  in  particular  the 
minimax  optimal  universal  model  Pnmu  can  be  used  to  define  a  refined  version  of 
MDL  model  selection.  Here  we  only  discuss  the  simplest  case:  suppose  we  are  given 
data  D  =  (xi,  •  ■  •  ,  xn)  and  two  models  and  such  that  COMPn(A/l^1)) 
and  COMPn(A4®)  (4.17)  are  both  finite.  For  example,  we  could  have  some  binary 
data  and  and  are  the  first-  and  second-order  Markov  models  (example 
4.7),  both  considered  possible  explanations  for  the  data.  We  show  how  to  deal  with  an 
infinite  number  of  models  and/or  models  with  infinite  COMPn  in  section  4.6. 

Denote  by  Pnmi{' \M^)  the  NML  distribution  on  Xn  corresponding  to  model 
Refined  MDL  tells  us  to  pick  the  model  maximizing  the  normalized  maximum 
likelihood  Pnmi(D\M ^),  or,  by  (4.18),  equivalently,  minimizing 

-  log  Pnmi(D\M^)  =  -logP(D|0W(D))  +  COMP n(M«).  (4.20) 

From  a  coding  theoretic  point  of  view,  we  associate  with  each  M  a  code  with  lengths 
Pnml  ( *  |  A4  ) ,  and  we  pick  the  model  minimizing  the  code  length  of  the  data.  The  code 
length  —  logPnm/(D|A4^)  has  been  called  the  stochastic  complexity  of  the  data  D 
relative  to  model  M  ^  [92],  whereas  COMPn(Af  is  called  the  parametric  com¬ 
plexity  or  model  cost  of  MW  (in  this  chapter  we  simply  call  it  ‘complexity’).  We  have 
already  indicated  in  the  previous  section  that  COMPn(AT^)  measures  something  like 
the  ‘complexity’  of  model  On  the  other  hand,  —  log  P(D\6^  ( D ))  is  minus  the 

maximized  log-likelihood  of  the  data,  so  it  measures  something  like  (minus)  fit  or  er¬ 
ror  -  in  the  linear  regression  case,  it  can  be  directly  related  to  the  mean  squared  error, 
section  4.8.  Thus,  (4.20)  embodies  a  trade-off  between  lack  of  fit  (measured  by  minus 
log-likelihood)  and  complexity  (measured  by  COMPn(A4^)).  The  confidence  in  the 
decision  is  given  by  the  code  length  difference 

-log  Pnml(D\M^)  -  (-logPnml(D\M^))\ . 

In  general,  —  log Pnmi(D\M)  can  only  be  evaluated  numerically  -  the  only  excep¬ 
tion  is  when  M  is  the  Gaussian  family,  example  4.18.  In  many  cases  even  numerical 
evaluation  is  computationally  problematic.  But  the  re-interpretations  of  Pnmi  we  pro¬ 
vide  below  also  indicate  that  in  many  cases,  —  log  Pnmi(D\M)  is  relatively  easy  to 
approximate. 

Example  4.13  (Refined  MDL  and  GLRT) 

Generalized  likelihood  ratio  testing  [15]  tells  us  to  pick  the  M  W  maximizing  log  P(D\9^  ( D))+ 
c  where  c  is  determined  by  the  desired  type-I  and  type-II  errors.  In  practice  one  often 
applies  a  naive  variation*,  simply  picking  the  model  M  ^  maximizing  log  P(D\9^  ( D )). 

This  amounts  to  ignoring  the  complexity  terms  COMPn(A4^)  in  (4.20):  MDL  tries 
to  avoid  overfitting  by  picking  the  model  maximizing  the  normalized  rather  than  the 
ordinary  likelihood.  The  more  distributions  in  M  that  fit  the  data  well,  the  larger  the 
normalization  term.  0 

*  To  be  fair,  we  should  add  that  this  naive  version  of  GLRT  is  introduced  here  for  educational 
purposes  only.  It  is  not  recommended  by  any  serious  statistician! 
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The  hope  is  that  the  normalization  term  COMPn(A4  ^)  strikes  the  right  balance  be¬ 
tween  complexity  and  fit.  Whether  it  really  does  this,  depends  on  whether  COMPn  is 
a  ‘good’  measure  of  complexity.  In  the  remainder  of  this  section  we  shall  argue  that 
it  is,  by  giving  four  different  interpretations  of  COMPn  and  of  the  resulting  trade-off 
(4.20): 

1 .  Compression  interpretation. 

2.  Counting  interpretation. 

3.  Bayesian  interpretation. 

4.  Prequential  (predictive)  interpretation. 

4.5.1  Compression  interpretation 

Rissanen’s  original  goal  was  to  select  the  model  that  detects  the  most  regularity  in 
the  data;  he  identified  this  with  the  ‘model  that  allows  for  the  most  compression  of 
data  xn\  To  make  this  precise,  a  code  is  associated  with  each  model.  The  NML  code 
with  lengths  -  log  Pnm/(* \M^)  seems  to  be  a  very  reasonable  choice  for  such  a  code 
because  of  the  following  two  properties: 

1.  The  better  the  best-fitting  distribution  in  M  fits  the  data,  the  shorter  the  code 
length  -  logPnmi(DjMW). 

2.  No  distribution  in  is  given  a  prior  preference  over  any  other  distribution, 
since  the  regret  of  Pnmi(‘\M^)  is  the  same  for  all  D  e  Xn  (4.19).  Pnmi  is 
the  only  complete  prefix  code  with  this  property,  which  may  be  restated  as: 
Pnmi  treats  all  distributions  within  each  M ^  on  the  same  footing! 

Therefore,  if  one  is  willing  to  accept  the  basic  ideas  underlying  MDL  as  first  principles, 
then  the  use  of  NML  in  model  selection  is  now  justified  to  some  extent.  Below  we  give 
additional  justifications  that  are  not  directly  based  on  data  compression;  but  we  first 
provide  some  further  interpretation  of  —  log  Pnmh 

Compression  and  separating  structure  from  noise  We  present  the  following  ideas 
in  an  imprecise  fashion  -  Rissanen  and  Tabus  [99]  recently  showed  how  to  make  them 
precise.  The  stochastic  complexity  of  data  D  relative  to  M,  given  by  (4.20)  can  be 
interpreted  as  the  amount  of  information  in  the  data  relative  to  M,  measured  in  bits. 
Although  a  one-part  code  length,  it  still  consists  of  two  terms:  a  term  COMPn(A4) 
measuring  the  amount  of  structure  or  meaningful  information  in  the  data  (as  ‘seen 
through  M’\  and  a  term  -  \ogP(D\9(D))  measuring  the  amount  of  noise  or  acci¬ 
dental  information  in  the  data.  To  see  that  this  second  term  measures  noise,  consider 
the  regression  example,  example  3.2,  again.  As  will  be  seen  in  section  4.8,  in  that  case 
—  log  P(D\9(D))  becomes  equal  to  a  linear  function  of  the  mean  squared  error  of  the 
best-fitting  polynomial  in  the  set  of  A:-th  degree  polynomials.  To  see  that  the  first  term 
measures  structure,  we  reinterpret  it  below  as  the  number  of  bits  needed  to  specify 
a  ‘distinguishable’  distribution  in  M,  using  a  uniform  code  on  all  ‘distinguishable’ 
distributions. 


4.5.2  Counting  interpretation 

The  parametric  complexity  can  be  interpreted  as  measuring  (the  log  of)  the  number  of 
distinguishable  distributions  in  the  model.  Intuitively,  the  more  distributions  a  model 
contains,  the  more  patterns  it  can  fit  well,  so  the  larger  the  risk  of  overfitting.  However, 
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if  two  distributions  are  very  ‘close’  in  the  sense  that  they  assign  high  likelihood  to  the 
same  patterns,  they  do  not  contribute  so  much  to  the  complexity  of  the  overall  model. 
It  seems  that  we  should  measure  complexity  of  a  model  in  terms  of  the  number  of 
distributions  it  contains  that  are  ‘essentially  different’  (distinguishable),  and  we  now 
show  that  COMPn  measures  something  like  this.  Consider  a  finite  model  M  with 
parameter  set  0  =  {Q\,  *  •  •  ,  6m }•  Note  that 


E  P(xn\0(xn)) 

xnexn 


E  E  p(*”i%)=  E 

j=l -M  „  j—\  'M 

6(xn)=0j 

M-E^(*n)^|0;)- 


(  \ 
l-  e  p(*ni%) 

xn 

\  Hxn)*0j  ) 


We  may  think  of  P(9(xn)  ^  9j\9j)  as  the  probability,  according  to  6j ,  that  the  data 
look  as  if  they  come  from  some  6  ^  Oj.  Thus,  it  is  the  probability  that  9j  is  mistaken 
for  another  distribution  in  0.  Therefore,  for  finite  M ,  Rissanen’s  model  complexity  is 
the  logarithm  of  the  number  of  distributions  minus  the  summed  probability  that  some 
6j  is  ‘mistaken’  for  some  9  ^  6j.  Now  suppose  M  is  i.i.d.  By  the  law  of  large  numbers 
[30],  we  immediately  see  that  the  ‘sum  of  mistake  probabilities’  Yhj  P(9(xn)  ^  9j\9j) 
tends  to  0  as  n  grows.  It  follows  that  for  large  n,  the  model  complexity  converges  to 
log  M.  For  large  n,  the  distributions  in  M  are  ‘perfectly  distinguishable’  (the  probabil¬ 
ity  that  a  sample  coming  from  one  is  more  representative  of  another  is  negligible),  and 
then  the  parametric  complexity  COMPn(Af )  of  M  is  simply  the  log  of  the  number  of 
distributions  in  M. 

Example  4.14  (NML  vs.  two-part  codes) 

Incidentally,  this  shows  that  for  finite  i.i.d.  M,  the  two-part  code  with  uniform  prior 
W  on  M  is  asymptotically  minimax  optimal:  for  all  n,  the  regret  of  the  2-part  code  is 
logM  (4.11),  whereas  we  just  showed  that  for  n  — >  oo,  7 Z(Pnmi)  =  COMPn(Af)  — > 
logM.  However,  for  small  n ,  some  distributions  in  M  may  be  mistaken  for  one  an¬ 
other;  the  number  of  distinguishable  distributions  in  M  is  then  smaller  than  the  actual 
number  of  distributions,  and  this  is  reflected  in  COMPn(Af )  being  (sometimes  much) 
smaller  than  log  M.  0 


Asymptotic  expansion  of  Pnmi  and  COMPn  Let  M  be  a  fc-dimensional  parametric 
model.  Under  regularity  conditions  on  M  and  the  parameterization  0  — >  Mf  to  be 
detailed  below,  we  obtain  the  following  asymptotic  expansion  (n  — >  oo)  of  COMPn 
[95],  [109],  [110],  [111]: 


COMPn(jM)  =  f  1°&  2w  +  log  /  V\m\d0  +  O(  1).  (4.21) 

Jdee 

Here  k  is  the  number  of  parameters  (degrees  of  freedom)  in  model  M,  n  is  the  sample 
size,  and  0(1)  — >  0  as  n  — *  oo.  \I(9)  |  is  the  determinant  of  the  k  x  k  Fisher  information 
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matrix^  I  evaluated  at  9.  In  case  M  is  an  i.i.d.  model,  I  is  given  by 

This  is  generalized  to  non-i.i.d.  models  as  follows: 

Equation  (4.21)  only  holds  if  the  model  M ,  its  parameterization  0  and  the  sequence 
xi,  #2,  *  *  *  all  satisfy  certain  conditions.  Specifically,  we  require: 

1.  COMPn(Af)  <  oo  and  J  y/\I(9)\d9  <  oo. 

2.  9{xn )  does  not  come  arbitrarily  close  to  the  boundary  of  ©:  for  some  e  >  0, 
for  all  large  n,  9{xn)  remains  farther  than  e  away  from  the  boundary  of  ©. 

3.  M  and  0  satisfy  certain  further  conditions.  A  simple  sufficient  condition  is 
that  M  be  an  exponential  family  [15].  Roughly,  this  is  a  family  that  can  be 
parameterized  so  that  for  all  x ,  P(x\/3)  =  exp (Pt(x))f(x)g(P)9  where  t  : 
X  — >  R  is  a  function  of  X .  The  Bernoulli  model  is  an  exponential  family, 
as  can  be  seen  by  setting  (3  ln(l  —  0)  —  In#  and  t(x)  =  x .  Also  the 
multinomial,  Gaussian,  Poisson,  Gamma,  exponential,  Zipf  and  many  other 
models  are  exponential  families;  but,  for  example,  mixture  models  are  not. 

More  general  conditions  are  given  by  Takeuchi  and  Barron  [109],  [110],  [111].  Essen¬ 
tially,  if  M  behaves  ‘asymptotically’  like  an  exponential  family,  then  (4.21)  still  holds. 
For  example,  (4.21)  holds  for  the  Markov  models  and  for  AR  and  ARMA  processes. 

Example  4.15  (Complexity  of  Bernoulli  model) 

The  Bernoulli  model  B (°)  can  be  parameterized  in  a  1-1  way  by  the  unit  interval  (ex¬ 
ample  4.7).  Thus,  k  —  1.  An  easy  calculation  shows  that  the  Fisher  information  is 
given  by  <9(1  -  9).  Plugging  this  into  (4.21)  and  calculating  /  y/\9(  1  —  9)\d0  gives 

COMPn(#(°))  =  pogn+  pogf  -3  +  C>(l)  =  ±logn-  2.674251935  +  0(1). 

Computing  the  integral  of  the  Fisher  determinant  is  not  easy  in  general.  Hanson  and 
Fu  [53]  compute  it  for  several  practically  relevant  models.  <) 

Whereas  for  finite  M,  COMPn(A4)  remains  finite,  for  parametric  models  it  generally 
grows  logarithmically  in  n.  Since  typically  —  \ogP(xn\9{xn))  grows  linearly  in  n,  it 
is  still  the  case  that  for  fixed  dimensionality  k  (i.e.  for  a  fixed  A4  that  is  A;-dimensional) 
and  large  n ,  the  part  of  the  code  length  -  log  Pnmi(xn \M)  due  to  the  complexity  of 
M  is  very  small  compared  to  the  part  needed  to  encode  data  xn  with  9{xn),  The  term 
/©  V\m\de  may  be  interpreted  as  the  contribution  of  the  functional  form  of  M  to 
the  model  complexity  [5].  It  does  not  grow  with  n  so  that,  when  selecting  between  two 
models,  it  becomes  irrelevant  and  can  be  ignored  for  very  large  n.  But  for  small  n,  it 
can  be  important,  as  can  be  seen  from  example  3.4,  Fechner’s  and  Stevens’  model.  Both 

t  The  standard  definition  of  Fisher  information  [63]  is  in  terms  of  first  derivatives  of  the  log- 
likelihood;  for  most  parametric  models  of  interest,  the  present  definition  coincides  with  the 
standard  one. 
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models  have  two  parameters,  yet  the  fe  \J \ I (9)\d9 -term  is  much  larger  for  Fechner’s 
than  for  Stevens’  model.  In  the  experiments  in  [80],  the  parameter  set  was  restricted  to 
0  <  a  <  1,  0  <  b  <  3  for  Stevens’  model  and  0  <  a  <  1,  0  <  b  <  1  for  Fechner’s 
model.  The  variance  of  the  error  Z  was  set  to  1  in  both  models.  With  these  values, 
the  difference  in  fe  y/\I(9)\dO  is  3.804,  which  is  non-negligible  for  small  samples. 
Thus,  Stevens’  model  contains  more  distinguishable  distributions  than  Fechner’s,  and 
is  better  able  to  capture  random  noise  in  the  data  -  as  Townsend  [113]  already  spec¬ 
ulated  almost  30  years  ago.  Experiments  suggest  that  for  regression  models  such  as 
Stevens’  and  Fechner’s,  as  well  as  for  Markov  models  and  general  exponential  fam¬ 
ilies,  the  approximation  (4.21)  is  reasonably  accurate  already  for  small  samples.  But 
this  is  certainly  not  true  for  general  models: 

The  asymptotic  expansion  of  COMPn  should  be  used  with  care! 

Equation  (4.21)  does  not  hold  for  all  parametric  models;  and  for  some  models  for 
which  it  does  hold,  the  0(1)  term  may  only  converge  to  0  only  for  quite  large  sample 
sizes.  In  [32],  [34]  it  is  shown  that  the  approximation  (4.21)  is,  in  general,  only  valid 
if  k  is  much  smaller  than  n. _ 

Two-part  codes  and  COMPn(.M)  We  now  have  a  clear  guiding  principle  (minimax 
regret)  which  we  can  use  to  construct  ‘optimal’  two-part  codes,  that  achieve  the  min¬ 
imax  regret  among  all  two-part  codes.  How  do  such  optimal  two-part  codes  compare 
to  the  NML  code  length?  Let  M  be  a  fc-dimensional  model.  By  slightly  adjusting  the 
arguments  of  [8],  one  can  show  that,  under  regularity  conditions,  the  minimax  optimal 
two-part  code  p2-v  achieves  regret 

11  =  -\ogp2-p(xn\M)  +  logP(xn\6(xn)) 

=  |  log  ft  +  log  [  VW)\d0  +  m  +  0(1), 

Joe® 

where  /  :  N  — >  R  is  a  bounded  positive  function  satisfying  lim^oo  f(k)  =  0.  Thus, 
for  large  k ,  optimally  designed  two-part  codes  are  about  as  good  as  NML.  The  problem 
with  two-part  code  MDL  is  that  in  practice,  people  often  use  much  cruder  codes  with 
much  larger  minimax  regret. 


4.53  Bayesian  interpretation 

The  Bayesian  method  of  statistical  inference  provides  several  alternative  approaches 
to  model  selection.  The  most  popular  of  these  is  based  on  Bayes  factors  [63].  The 
Bayes  factor  method  is  very  closely  related  to  the  refined  MDL  approach.  Assuming 
uniform  priors  on  models  M W  and  it  tells  us  to  select  the  model  with  largest 
marginal  likelihood  PBayes(xn\M^).  Psayes  is  as  in  (4.12),  with  the  sum  replaced 
by  an  integral: 

PBayes(xn\M^)  =  J  P(xn\9)w^(0)d9,  (4.22) 

where  is  the  density  of  the  prior  distribution  on  M^\ 

M  is  an  exponential  family  Let  now  P Bayes  =  PBayes(' \M)  for  some  fixed  model 
M.  Under  regularity  conditions  on  M,  we  can  perform  a  Laplace  approximation  of 
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the  integral  in  (4.12).  For  the  special  case  that  M  is  an  exponential  family,  we  obtain 
the  following  expression  for  the  regret  [59],  [100],  [63],  [4]: 

U  =  —  log  PBayes(xTl)  ~  log  P(xn\§(xn))) 

=  |log£-  Iogu»(l)  +  log  y/m  +  0(1).  (4.23) 

Let  us  compare  this  with  (4.21).  Under  the  regularity  conditions  needed  for  (4.21), 
the  quantity  on  the  right  hand  side  of  (4.23)  is  within  0(1)  of  COMPn(A4).  Thus, 
the  code  length  achieved  with  P Bayes  is  the  minimax  optimum  —  log  Pnmi{xTl)>  apart 
from  a  constant.  Since  —  log  P(xn\9(xn))  increases  linearly  in  n ,  this  means  that  if 
we  compare  two  models  and  M^2\  then  for  large  enough  n,  Bayes  and  refined 
MDL  select  the  same  model.  If  we  equip  the  Bayesian  universal  model  with  a  special 
prior  known  as  the  Jeffreys-Bemardo  prior  [59],  [10], 


Wjeffreys(@)  — 


V\m 

leee  y/\W\M' 


(4.24) 


then  Bayes  and  refined  NML  become  even  more  closely  related:  plugging  in  (4.24)  into 
(4.23),  we  find  that  the  right-hand  side  of  (4.23)  now  simply  coincides  with  (4.21).  A 
concrete  example  of  Jeffreys’  prior  is  given  in  example  4.19.  Jeffreys  introduced  his 
prior  as  a  ‘least  informative  prior’,  to  be  used  when  no  useful  prior  knowledge  about 
the  parameters  is  available  [59].  As  one  may  expect  from  such  a  prior,  it  is  invari¬ 
ant  under  continuous  1-to-l  reparameterizations  of  the  parameter  space.  The  present 
analysis  shows  that,  when  M  is  an  exponential  family,  then  it  also  leads  to  asymptoti¬ 
cally  minimax  code  length  regret:  for  large  n,  refined  NML  model  selection  becomes 
indistinguishable  from  Bayes  factor  model  selection  with  Jeffreys’  prior. 

M  is  not  an  exponential  family  Under  weak  conditions  on  M,  0  and  the  sequence 
xn,  we  get  the  following  generalization  of  (4.23): 


-  log  PBayes(xn\M)  = 

-  log P(xn\6(xn))  +  flog£-  log w(0)  +  log  ^|/(s»)|  +  0(1).  (4.25) 

Here  I(xn)  is  the  so-called  observed  information,  sometimes  also  called  observed 
Fisher  information;  see  [63]  for  a  definition.  If  M  is  an  exponential  family,  then  the  ob¬ 
served  Fisher  information  at  xn  coincides  with  the  Fisher  information  at  6(xn ),  leading 
to  (4.23).  If  M  is  not  exponential,  then  provided  that  the  data  are  distributed  accord¬ 
ing  to  one  of  the  distributions  in  M,  the  observed  Fisher  information  still  converges 
with  probability  1  to  the  expected  Fisher  information.  If  M  is  neither  exponential, 
nor  are  the  data  actually  generated  by  a  distribution  in  M,  then  there  may  be  0(1)- 
discrepancies  between  -  log  Pnmi  and  -  log  Psayes  even  f°r  large  n. 


4.5.4  Prequential  interpretation 

Distributions  as  prediction  strategies  Let  P  be  a  distribution  on  Xn.  Applying  the 
definition  of  conditional  probability,  we  can  write  for  every  xn\ 


P(xn) 


n 


P(xi) 

POr*-1) 


U  P(.Xi\x*  1), 
2=1 


(4.26) 
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so  that  also 

n 

-  log  P(xn )  =  ^2  -  log  P(xi\xt~1).  (4.27) 

*= l 

Let  us  abbreviate  P{Xi  =  -| X 1-1  =  x1'1)  to  P(A'i|xl“1).  Note  that  P(Xi\x%~1) 
(capital  Xi)  is  the  distribution  (not  a  single  number)  of  Xi  given  xt_1;  P(xi\xl~1) 
(lower  case  x^)  is  the  probability  (a  single  number)  of  actual  outcome  x*  given  x1"1. 
We  can  think  of  —  log  P(xi\xl'~1)  as  the  loss  incurred  when  predicting  Xi  based  on  the 
conditional  distribution  P(2Q|xl_1),  and  the  actual  outcome  turned  out  to  be  X*.  Here 
‘loss’  is  measured  using  the  so-called  logarithmic  score,  also  known  simply  as  ‘log 
loss’.  Note  that  the  more  likely  x  is  judged  to  be,  the  smaller  the  loss  incurred  when 
x  actually  is  obtained.  The  log  loss  has  a  natural  interpretation  in  terms  of  sequential 
gambling  [20],  but  its  main  interpretation  is  still  in  terms  of  coding:  by  (4.27),  the  code 
length  needed  to  encode  xn  based  on  distribution  P  is  just  the  accumulated  log  loss 
incurred  when  P  is  used  to  sequentially  predict  the  i-th  outcome  based  on  the  past 
(i  —  l)-st  outcomes. 

Equation  (4.26)  gives  a  fundamental  re-interpretation  of  probability  distributions  as 
prediction  strategies,  mapping  each  individual  sequence  of  past  observations  x  i ,  •  •  •  ,  i 
to  a  probabilistic  prediction  of  the  next  outcome  P(Xi\xl~1).  Conversely,  (4.26)  also 
shows  that  every  probabilistic  prediction  strategy  for  sequential  prediction  of  n  out¬ 
comes  may  be  thought  of  as  a  probability  distribution  on  Xn\  a  strategy  is  iden¬ 
tified  with  a  function  mapping  all  potential  initial  segments  x%~1  to  the  prediction 
that  is  made  for  the  next  outcome  X{t  after  having  seen  x1"1.  Thus,  it  is  a  function 
S  :  Uo<i<n  X1  ~ ►  Px,  where  Vx  is  the  set  of  distributions  on  X.  We  can  now  de¬ 
fine,  for  each  i  <  n,  all  x%  e  X \  PpQJx*""1)  :=  S^x1-1).  We  can  turn  these  partial 
distributions  into  a  distribution  on  Xn  by  sequentially  plugging  them  into  (4.26). 

Log  loss  for  universal  models  Let  M  be  some  parametric  model  and  let  P  be  some 
universal  model/code  relative  to  M,  What  do  the  individual  predictions  P(Xi\xt~1) 
look  like?  Readers  familiar  with  Bayesian  statistics  will  realize  that  for  i.i.d.  mod¬ 
els,  the  Bayesian  predictive  distribution  PBayes(Xi\xl~l)  converges  to  the  ML  distri¬ 
bution  P('\9(xl~1));  example  4.17  provides  a  concrete  case.  It  seems  reasonable  to 
assume  that  something  similar  holds  not  just  for  Psayes  but  for  universal  models  in 
general.  This  in  turn  suggests  that  we  may  approximate  the  conditional  distributions 
P(Xi\xl~1)  of  any  ‘good’  universal  model  by  the  maximum  likelihood  predictions 
P(  |0(xt~1)).  Indeed,  we  can  recursively  define  the  ‘maximum  likelihood  plug-in’ 
distribution  Ppiug-in  by  setting,  for  i  —  1  to  n, 

Pplug-in(Xi  =  y-1)  :=  P(X  -  -l^-1)).  (4.28) 

Then,  we  define 

n 

-  log  Ppiug-in(xn)  :=  -log  P(a>i|0(x,_1)).  (4.29) 

t=l 

Indeed,  it  turns  out  that  under  regularity  conditions  on  M  and  xn, 

-  log Ppiug-in(xn)  -  -  logP(a;n|f?(a:n))  +  |  logn  +  0(1).  (4.30) 
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This  shows  that  Ppiug~in  acts  as  a  universal  model  relative  to  M,  its  performance  being 
within  a  constant  of  the  minimax  optimal  Pnm/.  The  construction  of  Ppiug-in  can  be 
easily  extended  to  non-i.i.d.  models,  and  then,  under  regularity  conditions,  (4.30)  still 
holds;  we  omit  the  details. 

We  note  that  all  general  proofs  of  (4.30)  that  we  are  aware  of  show 
that  (4.30)  holds  with  probability  1  or  in  expectation  for  sequences 
generated  by  some  distribution  in  M  [90],  [91],  [93].  Note  that  the 
expressions  (4.21)  and  (4.25)  for  the  regret  of  Pnmi  and  Psayes  bold 
for  a  much  wider  class  of  sequences;  they  also  hold  with  probability 
1  for  i.i.d.  sequences  generated  by  sufficiently  regular  distributions 
outside  M.  Not  much  is  known  about  the  regret  obtained  by  Ppiug-in 
for  such  sequences,  except  for  some  special  cases  such  as  M  being 
the  Gaussian  model. 

In  general,  there  is  no  need  to  use  the  ML  estimator  9(xl~l)  in  the  definition  (4.28). 
Instead,  we  may  try  some  other  estimator  which  asymptotically  converges  to  the  ML 
estimator  -  it  turns  out  that  some  estimators  considerably  outperform  the  ML  estima¬ 
tor  in  the  sense  that  (4.29)  becomes  a  much  better  approximation  of  -  log  Pnmu  see 
example  4.17.  Irrespective  of  whether  we  use  the  ML  estimator  or  something  else,  we 
call  model  selection  based  on  (4.29)  the  prequential  form  of  MDL  in  honor  of  A.P. 
Dawid’s  ‘prequential  analysis’,  section  4.8.  It  is  also  known  as  ‘predictive  MDL’.  The 
validity  of  (4.30)  was  discovered  independently  by  Rissanen  [90]  and  Dawid  [22]. 

The  prequential  view  gives  us  a  fourth  interpretation  of  refined  MDL  model  selection: 
given  models  and  MDL  tells  us  to  pick  the  model  that  minimizes  the 

accumulated  prediction  error  resulting  from  sequentially  predicting  future  outcomes 
given  all  the  past  outcomes. 

Example  4.16  (GLRT  and  prequential  model  selection) 

How  does  this  differ  from  the  naive  version  of  the  generalized  likelihood  ratio  test 
(GLRT)  that  we  introduced  in  example  4.15?  In  GLRT,  we  associate  with  each  model 
the  log-likelihood  (minus  log  loss)  that  can  be  obtained  by  the  ML  estimator.  This  is 
the  predictor  within  the  model  that  minimizes  log  loss  with  hindsight,  after  having  seen 
the  data.  In  contrast,  prequential  model  selection  associates  with  each  model  the  log- 
likelihood  (minus  log  loss)  that  can  be  obtained  by  using  a  sequence  of  ML  estimators 
9{xl~l)  to  predict  data  Crucially,  the  data  on  which  ML  estimators  are  evaluated 
has  not  been  used  in  constructing  the  ML  estimators  themselves.  This  makes  the  pre¬ 
diction  scheme  ‘honest’  (different  data  are  used  for  training  and  testing)  and  explains 
why  it  automatically  protects  us  against  overfitting.  0 


Example  4.17  (Laplace  and  Jeffreys) 

Consider  the  prequential  distribution  for  the  Bernoulli  model,  example  4.7,  defined 
as  in  (4.28).  We  show  that  if  we  take  9  in  (4.28)  equal  to  the  ML  estimator 
then  the  resulting  Ppiug-in  is  not  a  universal  model;  but  a  slight  modification  of  the 
ML  estimator  makes  Ppiug-in  a  very  good  universal  model.  Suppose  that  n  >  3  and 
(®i,  X2 ,  £3)  —  (0,  0, 1)  -  a  not-so-unlikely  initial  segment  according  to  most  9.  Then 
Pplug-in{X 3  =  1|^1,  X2)  =  P(X  =  l|0(xi,  x2))  =  0,  so  that  by  (4.29),  we  get 

“log  Pplug-in(xn)  >  -logPp/^— in(x3|.Ti,X2)  =  OO, 


TNO  report 


TNO-DV1  2004  A234 


79 


hence  PpiUg-in  is  not  universal.  Now  let  us  consider  the  modified  ML  estimator 


6x{xn)  := 


n(i]  +  A 
n  +  2A  ' 


(4.31) 


If  we  take  A  =  0,  we  get  the  ordinary  ML  estimator.  If  we  take  A  =  1,  then  an  exercise 
involving  beta-integrals  shows  that,  for  all  i,  xl ,  P{Xi\9i(xt~1))  =  Psayesi^i I#1-1)* 
where  PBayes  is  defined  relative  to  the  uniform  prior  w(9)  =  1.  Thus  #i(£*_1)  cor¬ 
responds  to  the  Bayesian  predictive  distribution  for  the  uniform  prior.  This  prediction 
rule  was  advocated  by  the  great  probabilist  P.S.  de  Laplace,  co-originator  of  Bayesian 
statistics.  It  may  be  interpreted  as  ML  estimation  based  on  an  extended  sample,  con¬ 
taining  some  ‘virtual’  data:  an  extra  0  and  an  extra  1. 

Even  better,  a  similar  calculation  shows  that  if  we  take  A  —  2,  the  resulting  estimator 
is  equal  to  PBayesi^il^1)  defined  relative  to  Jeffreys’  prior.  Asymptotically,  PBayes 
with  Jeffreys’  prior  achieves  the  same  code  lengths  as  Pnmi  (section  4.5.3).  It  follows 
that  Ppiug-in  with  the  slightly  modified  ML  estimator  is  asymptotically  indistinguish¬ 
able  from  the  optimal  universal  model  Pnmi ! 

For  more  general  models  M,  such  simple  modifications  of  the  ML  estimator  usually  do 
not  correspond  to  a  Bayesian  predictive  distribution;  for  example,  if  M  is  not  convex 
(closed  under  taking  mixtures)  then  a  point  estimator  (an  element  of  M)  typically  does 
not  correspond  to  the  Bayesian  predictive  distribution  (a  mixture  of  elements  of  M ). 
Nevertheless,  modifying  the  ML  estimator  by  adding  some  virtual  data  y\ ,  •  •  •  ,  ym  and 
replacing  P(Xi\9(x'l~1))  by  P(Xi\§(xt“1i  ym))  in  the  definition  (4.28)  may  still  lead 
to  good  universal  models.  This  is  of  great  practical  importance,  since,  using  (4.29), 
-  log  PpiUg-in(xn )  is  often  much  easier  to  compute  than  -  log  pBayes{xn)*  0 


Summary  We  introduced  the  refined  MDL  principle  for  model  selection  in  a  restricted 
setting.  Refined  MDL  amounts  to  selecting  the  model  under  which  the  data  achieve  the 
smallest  stochastic  complexity,  which  is  the  code  length  according  to  the  minimax  op¬ 
timal  universal  model.  We  gave  an  asymptotic  expansion  of  stochastic  and  parametric 
complexity,  and  interpreted  these  concepts  in  four  different  ways. 


4.6  General  refined  MDL:  gluing  it  all  together 

In  the  previous  section  we  introduced  a  ‘refined’  MDL  principle  based  on  minimax 
regret.  Unfortunately,  this  principle  can  be  applied  only  in  very  restricted  settings.  We 
now  show  how  to  extend  refined  MDL,  leading  to  a  general  MDL  principle,  applicable 
to  a  wide  variety  of  model  selection  problems.  In  doing  so  we  glue  all  our  previous  in¬ 
sights  (including  ‘crude  MDL’)  together,  thereby  uncovering  a  single  general,  under¬ 
lying  principle,  formulated  in  observation  4.4.  Therefore,  if  one  understands  the  mate¬ 
rial  in  this  section,  then  one  understands  the  minimum  description  length  principle. 

First,  in  section  4.6.1,  we  show  how  to  compare  infinitely  many  models.  Then,  sec¬ 
tion  4.6.2  shows  how  to  proceed  for  models  M  for  which  the  parametric  complexity 
is  undefined.  Remarkably,  a  single,  general  idea  resides  behind  our  solution  of  both 
problems,  and  this  leads  us  to  formulate,  in  section  4.6.3,  a  single,  general  refined 
MDL  principle. 
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4.6.1  Model  selection  with  infinitely  many  models 

Suppose  we  want  to  compare  more  than  two  models  for  the  same  data.  If  the  number  to 
be  compared  is  finite,  we  can  proceed  as  before  and  pick  the  model  M  ^  with  smallest 
—  log  Pnmi(xn\M  If  the  number  of  models  is  infinite,  we  have  to  be  more  careful. 
Say  we  compare  models  •  ■  for  data  xn.  We  may  be  tempted  to  pick  the 

model  minimizing  -  log  Pnmi{xn\M^)  over  all  k  G  {1,  2,  •  •  •  },  but  in  some  cases 
this  gives  unintended  results.  To  illustrate,  consider  the  extreme  case  that  every  M  ^ 
contains  just  one  distribution.  For  example,  let  M W  =  {Pi},  M ^ 
where  {Pi,  P2,  •  •  •  }  is  the  set  of  all  Markov  chains  with  rational- valued  parameters. 
In  that  case,  COMPn(.A4(fe))  =  0  for  all  k ,  and  we  would  always  select  the  maximum 
likelihood  Markov  chain  that  assigns  probability  1  to  data  xn.  Typically  this  will  be  a 
chain  of  very  high  order,  severely  overfitting  the  data.  This  cannot  be  right!  A  better 
idea  is  to  pick  the  model  minimizing 

-\ogPnml(xn\M^)  +  L(k),  (4.32) 

where  L  is  the  code  length  function  of  some  code  for  encoding  model  indices  k.  We 
would  typically  choose  the  standard  prior  for  the  integers,  L(k)  —  2  log  + 1,  example 
4.4.  By  using  (4.32)  we  avoid  the  overfitting  problem  mentioned  above:  if  = 

{Pi},  M®  =  {P2},-  •  where  P1,P2,---  is  a  list  of  all  the  rational-parameter 
Markov  chains,  (4.32)  would  reduce  to  two-part  code  MDL  (section  4.3)  which  is 
asymptotically  consistent.  On  the  other  hand,  if  M  ^  represents  the  set  of  A-th  order 
Markov  chains,  the  term  L(k)  is  typically  negligible  compared  to  COMPn(A/f^)), 
the  complexity  term  associated  with  that  is  hidden  in  -  log  Pnml{M.^),  Thus, 
the  complexity  of  comes  from  the  fact  that  for  large  k ,  contains  many 
distinguishable  distributions;  not  from  the  much  smaller  term  L(k)  «  2  log  k. 

To  make  our  previous  approach  for  a  finite  set  of  models  compatible  with  (4.32),  we 
can  reinterpret  it  as  follows:  we  assign  uniform  code  lengths  (a  uniform  prior)  to  the 
*  •  *  ,  under  consideration,  so  that  for  A  =  1,  •  •  •  ,  M,  L(k)  =  log  M.  We 

then  pick  the  model  minimizing  (4.32),  Since  L(k)  is  constant  over  k ,  it  plays  no  role  in 
the  minimization  and  can  be  dropped  from  the  equation,  so  that  our  procedure  reduces 
to  our  original  refined  MDL  model  selection  method.  We  shall  henceforth  assume  that 
we  always  encode  the  model  index,  either  implicitly  (if  the  number  of  models  is  finite) 
or  explicitly.  The  general  principle  behind  this  is  explained  in  section  4.6.3. 


4.6.2  The  infinity  problem 

For  some  of  the  most  commonly  used  models,  the  parametric  complexity  COMP(.M) 
is  undefined.  A  prime  example  is  the  Gaussian  location  model,  which  we  discuss  be¬ 
low.  As  we  will  see,  we  can  ‘repair’  the  situation  using  the  same  general  idea  as  in  the 
previous  subsection. 


Example  4.18  (Parametric  complexity  of  the  normal  distributions) 

Let  M  be  the  family  of  normal  distributions  with  fixed  variance  a2  and  varying  mean 
f. 1 ,  identified  by  their  densities 


1 


exp 


(^) 


P{x\n)  = 
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extended  to  sequences  a?i,  •  *  •  ,  xn  by  taking  product  densities.  As  is  well-known  [15], 
the  ML  estimator  fi{xn)  is  equal  to  the  sample  mean:  ft(xn)  =  n"1  xi •  An  easy 
calculation  shows  that 

COMPn(A/()  =  f  P(xn\ft(xn))dxn  =  co, 

Jxn 

where  we  abbreviated  dx\  •  ♦  ♦  dxn  to  dxn .  Therefore,  we  cannot  use  basic  MDL  model 
selection.  It  also  turns  out  that  I(fj)  —  a”2  so  that 

[  V\m\d0=  j  y/W)W  =  OO. 

J  0  J  fi£R 


Thus,  the  Bayesian  universal  model  approach  with  Jeffreys’  prior  cannot  be  applied 
either.  Does  this  mean  that  our  MDL  model  selection  and  complexity  definitions  break 
down  even  in  such  a  simple  case?  Luckily,  it  turns  out  that  they  can  be  repaired,  as  we 
now  show.  In  [9]  and  [33]  it  is  shown  that,  for  all  intervals  [a,  6], 


/ 


P(xn\fi(xn))dxr' 


A(®n)€(a>&] 


b  —  a 
y/ZKcr 


•  y/n. 


(4.33) 


Suppose  for  the  moment  that  it  is  known  that  / 1  lies  in  some  interval  [—Ky  K]  for 
some  fixed  K.  Let  Mk  be  the  set  of  conditional  distributions  obtained  as  follows: 
Mk  =  {^(’IaOIm  £  M},  where  P'(xn\fi)  is  the  density  ofzn  according  to  the  normal 
distribution  with  mean  /i,  conditioned  on  |n_1  ^  x*|  <  K .  By  (4.33),  the  ‘conditional’ 
minimax  regret  distribution  Pnwl{'\J^k)  is  well-defined  for  all  K  >  0.  That  is,  for  all 
xn  with \fi(xn)\  <  AT,  we  obtain 


Pnml(xn\MK) 


P,(xn\ft(xn)) 

S\fi(xn)\<K  P,(xn\pixn))dxn 


with  regret  (or  in  this  case  ‘conditional’  complexity), 


COMP„(X^)  -  log  [  P\xn\fi{xn))dxn  =  log  +  \  log  *  -  log  a  +  1. 

J\ii(xn)\<K 

This  suggests  to  redefine  the  complexity  of  the  full  model  .M  so  that  its  regret  depends 
on  the  area  in  which  ft  falls.  The  most  straightforward  way  of  achieving  this  is  to  define 
a  meta-universal  model  for  M,  combining  the  NML  with  a  two-part  code:  we  encode 
data  by  first  encoding  some  value  for  K .  We  then  encode  the  actual  data  xn  using  the 
code  Pnmi(‘\MK)-  The  resulting  code  Pmeto  is  a  universal  code  for  M  with  lengths 


-  log Pmeta(xn\M)  :=  min{-  log  Pmeta(xn\MK)  +  L{K)}.  (4.34) 

iv 

The  idea  is  now  to  base  MDL  model  selection  on  Pmeta{ *|  A4)  as  in  (4.34)  rather  than 
on  the  (undefined)  Pnmi('\M).  To  make  this  work,  we  need  to  choose  L  in  a  clever 
manner.  A  good  choice  is  to  encode  K '  =  log  K  as  an  integer,  using  the  standard  code 
for  the  integers.  To  see  why,  note  that  the  regret  of  Pmeta  now  becomes: 


U  =  -logPmeta(x"|A4)  -  (-\ogP(xn\fi(xn))) 

=  K  nun  2  }0og  K  +  \  log  ^  -  log  a  +  1  +  2  logflog  K] }  +  1 

<  log|/i(a:n)|  +  21oglog|£(zn)|  +  ^  log ^  -  logo- +  4 

<  COMPn(A41/i|)  +  21ogCOMPn(A^|A|)  +  3. 


(4.35) 
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If  we  had  known  a  good  bound  K  on  |/x|  a  priori ,  we  could  have  used  the  NML 
model  Pnml('\^K)-  With  ‘maximal’  a  priori  knowledge,  we  would  have  used  the 
model  Pnml('\'M\ji\),  leading  to  regret  COMPn(.A4|£|).  The  regret  achieved  by  Pmeta 
is  almost  as  good  as  this  ‘smallest  possible  regret- with-hindsight’  COMPn(A^|^|): 
the  difference  is  much  smaller  than,  in  fact  logarithmic  in,  COMPn(A4|£|)  itself,  no 
matter  what  xn  we  observe.  This  is  the  underlying  reason  why  we  choose  to  encode 
K  with  log-precision:  the  basic  idea  in  refined  MDL  was  to  minimize  the  worst-case 
regret,  or  additional  code- length  compared  to  the  code  that  achieves  the  minimal  code¬ 
length  with  hindsight.  Here,  we  use  this  basic  idea  on  a  meta-level:  we  design  a  code 
such  that  the  additional  regret  is  minimized,  compared  to  the  code  that  achieves  the 
minimal  regret  with  hindsight.  0 


This  meta-two-part  coding  idea  was  introduced  by  Rissanen  [95].  It  can  be  extended 
to  a  wide  range  of  models  with  COMPn(A4 )  =  oo;  for  example,  if  the  X{  represent 
outcomes  of  a  Poisson  or  geometric  distribution,  one  can  encode  a  bound  on  p  just 
like  in  example  4.18.  If  M  is  the  full  Gaussian  model  with  both  p  and  a2  allowed  to 
vary,  one  has  to  encode  a  bound  on  p  and  a  bound  on  a2.  Essentially  the  same  holds 
for  linear  regression  problems,  section  4.7. 

Renormalized  maximum  likelihood  Meta  two-part  coding  is  just  one  possible  solu¬ 
tion  to  the  problem  of  undefined  COMPn(A4).  It  is  suboptimal,  the  main  reason  being 
the  use  of  2-part  codes.  Indeed,  these  2-part  codes  are  not  complete  (section  4.1):  they 
reserve  several  code  words  for  the  same  data  D  =  (x1?  •  •  •  ,  xn)  (one  for  each  inte¬ 
ger  value  of  log  K).  Therefore,  there  must  exist  more  efficient  (one-part)  codes  P^eta 
such  that  for  all  xn  <E  Xn ,  P^^x11)  >  Pmeta(xn).  In  accordance  with  the  idea  that 
we  should  minimize  description  length,  such  alternative  codes  are  preferable.  This  re¬ 
alization  has  led  to  a  search  for  more  efficient  and  intrinsic  solutions  to  the  problem. 
In  [33],  the  possibility  is  considered  of  restricting  the  parameter  values  rather  than 
the  data,  and  develop  a  general  framework  for  comparing  universal  codes  for  models 
with  undefined  COMP(Af).  Rissanen  [97]  suggests  the  following  elegant  solution. 
He  defines  the  renormalized  maximum  likelihood  (RNML)  distribution  Prnm/.  In  our 
Gaussian  example,  this  universal  model  would  be  defined  as  follows.  Let  K(xn)  be 
the  bound  on  p(xn)  that  maximizes  Pnml{x7l\MK)  for  the  actually  given  K.  That  is, 
K(xn)  =  \p(xn)\.  Then  Prnmi  is  for  all  xn  €  Xn  defined  as, 


Prnml{xTl\M) 


Pfiml  {pP1 1  Xi  )  ) 

fxn€Rn  Pnml(xU\X4 £(xn>))dxn 


(4.36) 


Model  selection  between  a  finite  set  of  models  now  proceeds  by  selecting  the  model 
maximizing  the  re-normalized  likelihood  (4.36). 

Region  indifference  All  the  approaches  considered  thus  far  slightly  prefer  some  re¬ 
gions  of  the  parameter  space  over  others.  In  spite  of  its  elegance,  even  the  Rissanen 
renormalization  is  slightly  ‘arbitrary’  in  this  way:  had  we  chosen  the  origin  of  the  real 
axis  differently,  the  same  sequence  xn  would  have  achieved  a  different  code  length 
-  log Prnmi(xn\M).  In  recent  work,  Liang  and  Barron  [73],  [74]  consider  a  novel 
and  quite  different  approach  for  dealing  with  infinite  COMPn(AJ)  that  partially  ad¬ 
dresses  this  problem.  They  make  use  of  the  fact  that,  while  Jeffreys’  prior  is  improper 
(f  Vm\dO  is  infinite),  using  Bayes’  rule  we  can  still  compute  Jeffreys’  posterior 
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based  on  the  first  few  observations,  and  this  posterior  turns  out  to  be  a  proper  probabil¬ 
ity  measure  after  all.  Liang  and  Barron  use  universal  models  of  a  somewhat  different 
type  than  Pnmu  so  it  remains  to  be  investigated  whether  their  approach  can  be  adapted 
to  the  form  of  MDL  discussed  here. 

4.6.3  The  general  picture 

Section  4.6.1  illustrates  that,  in  all  applications  of  MDL,  we  first  define  a  single  uni¬ 
versal  model  that  allows  us  to  code  all  sequences  with  length  equal  to  the  given  sample 
size.  If  the  set  of  models  is  finite,  we  use  the  uniform  prior.  We  do  this  in  order  to 
be  as  ‘honest*  as  possible,  treating  all  models  under  consideration  on  the  same  foot¬ 
ing.  But  if  the  set  of  models  becomes  infinite,  there  exists  no  uniform  prior  any  more. 
Therefore,  we  must  choose  a  non-uniform  prior/non-fixed  length  code  to  encode  the 
model  index.  In  order  to  treat  all  models  still  ‘as  equally  as  possible’,  we  should  use 
some  code  which  is  ‘close*  to  uniform,  in  the  sense  that  the  code  length  increases  only 
very  slowly  with  k.  We  choose  the  standard  prior  for  the  integers  (example  4.4),  but 
we  could  also  have  chosen  different  priors,  for  example,  a  prior  P(k)  which  is  uniform 
on  k  =  1  ■  M  for  some  large  M,  and  P(k)  oc  /c~2  for  k  >  M.  Whatever  prior  we 

choose,  we  are  forced  to  encode  a  slight  preference  of  some  models  over  others;  see 
section  4.9.1. 


Observation  4.4  (General  ‘refined’  MDL  principle  for  model  selection) 

Suppose  we  plan  to  select  between  models  -  ••  for  data  D  = 

(a?i,  •  •  •  ,  xn).  MDL  tells  us  to  design  a  universal  code  P  for  Xn ,  in  which  the  index 
k  of  is  encoded  explicitly.  The  resulting  code  has  two  parts ,  the  two  sub-codes 
being  defined  such  that: 

1.  All  models  are  treated  on  the  same  footing ,  as  far  as  possible:  we  assign 
a  uniform  prior  to  these  models ,  or,  if  that  is  not  a  possible ,  a  prior  ‘ close  to * 
uniform. 

2.  All  distributions  within  each  M.W  are  treated  on  the  same  footing ,  as  far  as 
possible:  we  use  the  minimax  regret  universal  model  Pnmi(xn\M  W).  If  this 
model  is  undefined  or  too  hard  to  compute ,  we  instead  use  a  different  universal 
model  that  achieves  regret  ‘ close  to*  the  minimax  regret  for  each  submodel  of 

in  the  sense  of  (4.35). 

In  the  end ,  we  encode  data  D  using  a  hybrid  two-part/one-part  universal  model ,  ex¬ 
plicitly  encoding  the  models  we  want  to  select  between  and  implicitly  encoding  any 
distributions  contained  in  those  models . 


Section  4.6.2  applies  the  same  idea,  but  implemented  at  a  meta-level:  we  try  to  asso¬ 
ciate  with  M a  code  for  encoding  outcomes  in  Xn  that  achieves  uniform  (=  mini¬ 
max)  regret  for  every  sequence  xn.  If  this  is  not  possible,  we  still  try  to  assign  regret 
as  ‘uniformly’  as  we  can,  by  carving  up  the  parameter  space  in  regions  with  larger 
and  larger  minimax  regret,  and  devising  a  universal  code  that  achieves  regret  not  much 
larger  than  the  minimax  regret  achievable  within  the  smallest  region  containing  the 
ML  estimator.  Again,  the  codes  we  used,  encoded  a  slight  preference  of  some  regions 
of  the  parameter  space  over  others,  but  our  aim  was  to  keep  this  preference  as  small  as 
possible.  The  general  idea  is  summarized  in  observation  4.4,  which  provides  an  (infor¬ 
mal)  definition  of  MDL,  but  only  in  a  restricted  context.  If  we  go  beyond  that  context, 
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these  prescriptions  cannot  be  used  literally  -  but  extensions  in  the  same  spirit  suggest 
themselves.  Here  is  a  first  example  of  such  an  extension: 


Example  4.19  (MDL  and  Local  Maxima  in  the  Likelihood) 

In  practice  we  often  work  with  models  for  which  the  ML  estimator  cannot  be  calculated 
efficiently;  or  at  least,  no  algorithm  for  efficient  calculation  of  the  ML  estimator  is 
known.  Examples  are  finite  and  Gaussian  mixtures  and  hidden  Markov  models.  In 
such  cases  one  typically  resorts  to  methods  such  as  expectation  maximization  (EM) 
or  gradient  descent,  which  find  a  local  maximum  of  the  likelihood  surface  (function) 
P(xn \0),  leading  to  a  local  maximum  likelihood  estimator  (LML)  9{xn ).  Suppose  we 
need  to  select  between  a  finite  number  of  such  models.  We  may  be  tempted  to  pick  the 
model  M  maximizing  the  normalized  likelihood  Pnmi(xn \M).  However,  if  we  then 
plan  to  use  the  local  estimator  6(xn)  for  predicting  future  data,  this  is  not  the  right 
thing  to  do.  To  see  this,  note  that,  if  suboptimal  estimators  6  are  to  be  used,  the  ability 
of  model  M  to  fit  arbitrary  data  patterns  may  be  severely  diminished!  Rather  than 
using  PjimU  we  should  redefine  it  to  take  into  account  the  fact  that  6  is  not  the  global 
ML  estimator: 


Pnml(xn)  ■■= 


P(xn  l*9(xn)) 

£*»<=*»  p(xn\Hxn)y 


leading  to  an  adjusted  parametric  complexity 


COMP^M)  :=  log  Y,  P{xn\0{xn)\  (4-37) 

a:n6^n 


which,  for  every  estimator  6  different  from  9  must  be  strictly  smaller  than  COMPn(.M). 

0 


Summary  We  have  shown  how  to  extend  refined  MDL  beyond  the  restricted  settings 
of  section  4.5.  This  uncovered  the  general  principle  behind  refined  MDL  for  model 
selection,  given  in  observation  4.4.  General  as  it  may  be,  it  only  applies  to  model 
selection  -  in  the  next  section  we  briefly  discuss  extensions  to  other  applications. 


4.7  Beyond  parametric  model  selection 

The  general  principle  as  given  in  observation  4.4  only  applies  to  model  selection.  It  can 
be  extended  in  several  directions.  These  range  over  many  different  tasks  of  inductive 
inference  -  we  mention  prediction,  transduction  (as  defined  in  [115]),  clustering  [68] 
and  similarity  detection  [72].  In  these  areas  there  has  been  less  research  and  a  ‘definite’ 
MDL  approach  has  not  yet  been  formulated. 

MDL  has  been  developed  in  some  detail  for  some  other  inductive  tasks:  non-parametric 
inference,  parameter  estimation  and  regression  and  classification  problems.  We  give  a 
very  brief  overview  of  these  -  for  details  we  refer  to  [9],  [52]  and,  for  the  classification 
case  [49]. 

Non-parametric  inference  Sometimes  the  model  class  M  is  so  large  that  it  cannot  be 
finitely  parameterized.  For  example,  let  X  —  [0, 1]  be  the  unit  interval  and  let  M  be  the 
i.i.d.  model  consisting  of  all  distributions  on  X  with  densities  /  such  that  —  log  f{x )  is 
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a  continuous  function  on  X.  M  is  clearly  ‘non-parametric’:  it  cannot  be  meaningfully 
parameterized  by  a  connected  finite-dimensional  parameter  set  0^  C  Rfc.  We  may 
still  try  to  learn  a  distribution  from  M  in  various  ways,  for  example  by  histogram 
density  estimation  [94]  or  kernel  density  estimation  [93].  MDL  is  quite  suitable  for 
such  applications,  in  which  we  typically  select  a  density  /  from  a  class  C  M , 
where  grows  with  n,  and  every  P*  G  M  can  be  arbitrarily  well  approximated 
by  members  of  M ,  •  ♦  •  in  the  sense  that  [9] 

lim  inf  D(P*\\P)  =  0. 

n-KX)  p£M(n) 

Here  D  is  the  Kullback-Leibler  divergence  [20]  between  P*  and  P. 

MDL  parameter  estimation:  three  approaches  The  ‘crude’  MDL  method  (section 
4.3)  was  a  means  of  doing  model  selection  and  parameter  estimation  at  the  same  time. 
‘Refined’  MDL  only  dealt  with  selection  of  models.  If  instead,  or  at  the  same  time, 
parameter  estimates  are  needed,  they  may  be  obtained  in  three  different  ways.  Histor¬ 
ically  the  first  way  [93],  [52]  was  to  simply  use  the  refined  MDL  principle  to  pick  a 
parametric  model  M^k\  and  then,  within  M^k\  pick  the  ML  estimator  §(k\  After  all, 
we  associate  with  the  distribution  Pnml  with  code  lengths  ‘as  close  as  possible’ 
to  those  achieved  by  the  ML  estimator.  This  suggests  that  within  M^k\  we  should 
prefer  the  ML  estimator.  But  upon  closer  inspection,  observation  4.4  suggests  to  use  a 
two-part  code  also  to  select  9  within  M  ^ ;  namely,  we  should  discretize  the  parameter 
space  in  such  a  way  that  the  resulting  2-part  code  achieves  the  minimax  regret  among 
all  two-part  codes;  we  then  pick  the  (quantized)  6  minimizing  the  two-part  code  length. 
Essentially  this  approach  has  been  worked  out  in  detail  by  Barron  and  Cover  [8].  The 
resulting  estimators  may  be  called  two-part  code  MDL  estimators.  A  third  possibility 
is  to  define  predictive  MDL  estimators  such  as  the  Laplace  and  Jeffreys  estimators  of 
example  4.17;  once  again,  these  can  be  understood  as  an  extension  of  observation  4.4 
[9].  These  second  and  third  possibilities  are  more  sophisticated  than  the  first.  However, 
if  the  model  M  is  finite-dimensional  parametric  and  n  is  large,  then  both  the  two-part 
and  the  predictive  MDL  estimators  will  become  indistinguishable  from  the  maximum 
likelihood  estimators.  For  this  reason,  it  has  sometimes  been  claimed  that  MDL  param¬ 
eter  estimation  is  just  ML  parameter  estimation.  Since  for  small  samples,  the  estimates 
can  be  quite  different,  this  statement  is  misleading. 

Regression  In  regression  problems  we  are  interested  in  learning  how  the  values  y\ ,  •  •  •  ,  yn 
of  a  regression  variable  Y  depend  on  the  values  xi,  •  •  •  ,  xn  of  the  regressor  variable 
X,  We  assume  or  hope  that  there  exists  some  function  h  :  X  — ►  y  so  that  h(X )  pre¬ 
dicts  the  value  Y  reasonably  well,  and  we  want  to  learn  such  an  h  from  data.  To  this 
end,  we  assume  a  set  of  candidate  predictors  (functions)  TL .  In  example  3.2,  we  took  H 
to  be  the  set  of  all  polynomials.  In  the  standard  formulation  of  this  problem,  we  take  h 
to  express  that 

Yi  =  h(Xi)  +  Zh  (4.38) 

where  the  Zi  are  i.i.d.  Gaussian  random  variables  with  mean  0  and  some  variance  a2, 
independent  of  Xp  That  is,  we  assume  Gaussian  noise:  equation  (4.38)  implies  that 
the  conditional  density  of  yi,  •  •  *  ,  yn,  given  x\,  •  •  •  ,  xn ,  is  equal  to  the  product  of  n 
Gaussian  densities 


P(yn\xn,a,  h)  - 


exp 


ESlite  - 

2a2 


(4.39) 
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With  this  choice,  the  log-likelihood  becomes  a  linear  function  of  the  squared  error: 

-  In  P(yn |.Tn,  (T,h)  =  -  Kxi))2  +  2  ln  2ira2-  (4-4°) 

i=l 

Note  that  ln(-)  =  ln  2  log(*).  Let  us  now  assume  that  Li  =  |Jfc>1  where  for  each 
fc,  is  a  set  of  functions  h  :  X  — »  y.  For  example,  may  be  the  set  of  k-th 
degree  polynomials. 

With  each  model  we  can  associate  a  set  of  densities  (4.39),  one  for  each  ( h ,  o2) 
with  h  E  H W  and  a2  E  M-1".  Let  be  the  resulting  set  of  conditional  distributions. 
Each  P(-\h,  a2)  E  is  identified  by  the  parameter  vector  (ao,  •  *  *  ,  cr2)  so  that 
/i(x)  :==:  ajx^-  By  section  4.6.1,  equation  (4.8),  MDL  tells  us  to  select  the  model 

minimizing 

-  In  P(yn\M{k\xn)  +  L(k ),  (4.41) 

where  we  may  take  L(k )  =  2  log k  +  1,  and  P(-\M^k\-)  is  now  a  conditional  uni¬ 
versal  model  with  small  minimax  regret.  Equation  (4.41)  ignores  the  code  length  of 
£i,  •  •  •  ,  xn.  Intuitively,  this  is  because  we  are  only  interested  in  learning  how  y  de¬ 
pends  on  x\  therefore,  we  do  not  care  how  many  bits  are  needed  to  encode  x.  Formally, 
this  may  be  understood  as  follows:  we  really  are  encoding  the  x- values  as  well,  but  we 
do  so  using  a  fixed  code  that  does  not  depend  on  the  hypothesis  h  under  consideration. 
Thus,  we  are  really  trying  to  find  the  model  M minimizing 

-  In  P(yn\M (fc),  xn)  +  L(k)  +  L'(xn), 

where  L'  represents  some  code  for  Xn.  Since  this  code  length  does  not  involve  k,  it  can 
be  dropped  from  the  minimization;  see  observation  4.5.  We  will  not  go  into  the  precise 
definition  of  P{yn\M^k\  xn).  Ideally,  it  should  be  an  NML  distribution,  but  just  as 
in  example  4.18,  this  NML  distribution  is  not  well-defined.  We  can  get  reasonable 
alternative  universal  models  after  all,  using  any  of  the  methods  described  in  section 
4.6.2;  see  [9]  and  [96]  for  details. 


Observation  4.5  (When  the  code  length  for  xn  can  be  ignored) 

If  all  models  under  consideration  represent  conditional  densities  or  probability  mass 
functions  P(Y\X),  then  the  code  length  for  X\y  •  •  ,  Xn  can  be  ignored  in  model  and 
parameter  selection.  Examples  are  applications  of  MDL  in  classification  and  regres¬ 
sion . 


‘Non-probabilistic’  regression  and  classification  In  the  approach  we  just  described, 
we  modeled  the  noise  as  being  normally  distributed.  Alternatively,  it  has  been  tried  to 
directly  try  to  learn  functions  h  E  H  from  the  data,  without  making  any  probabilis¬ 
tic  assumptions  about  the  noise  [93],  [7],  [125],  [44]  and  [45].  The  idea  is  to  learn  a 
function  h  that  leads  to  good  predictions  of  future  data  from  the  same  source  in  the 
spirit  of  Vapnik’s  [115]  statistical  learning  theory.  Here  prediction  quality  is  measured 
by  some  fixed  loss  function;  different  loss  functions  lead  to  different  instantiations  of 
the  procedure.  Such  a  version  of  MDL  is  meant  to  be  more  robust,  leading  to  inference 
of  a  ‘good’  h  E  Ti  irrespective  of  the  details  of  the  noise  distribution.  This  loss-based 
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approach  has  also  been  the  method  of  choice  in  applying  MDL  to  classification  prob¬ 
lems.  Here  Y  takes  on  values  in  a  finite  set,  and  the  goal  is  to  match  each  feature  X 
(for  example,  a  bit  map  of  a  handwritten  digit)  with  its  corresponding  label  or  class 
(e.g.,  a  digit).  While  several  versions  of  MDL  for  classification  have  been  proposed 
[83],  [93],  [64],  most  of  these  can  be  reduced  to  the  same  approach  based  on  a  0/1- 
valued  loss  function  [44].  In  recent  work  [49]  it  is  shown  that  this  MDL  approach  to 
classification  without  making  assumptions  about  the  noise  may  behave  suboptimally: 
situations  where  no  matter  how  large  n,  MDL  keeps  overfitting,  selecting  an  overly 
complex  model  with  suboptimal  predictive  behavior.  Modifications  of  MDL  suggested 
by  Barron  [7]  and  Yamanishi  [125]  do  not  suffer  from  this  defect,  but  they  do  not  ad¬ 
mit  a  natural  coding  interpretation  any  longer.  All  in  all,  current  versions  of  MDL  that 
avoid  probabilistic  assumptions  are  still  in  their  infancy,  and  more  research  is  needed 
to  find  out  whether  they  can  be  modified  to  perform  well  in  more  general  and  realistic 
settings. 

Summary  In  the  previous  sections,  we  have  covered  basic  refined  MDL  (section  4.5), 
general  refined  MDL  (section  4.6),  and  several  extensions  of  refined  MDL  (this  sec¬ 
tion).  This  concludes  our  technical  description  of  refined  MDL.  It  only  remains  to 
place  MDL  in  its  proper  context:  what  does  it  do  compared  to  other  methods  of  induc¬ 
tive  inference?  And  how  well  does  it  perform,  compared  to  other  methods?  The  next 
two  sections  are  devoted  to  these  questions. 


4.8  Relations  to  other  approaches  to  inductive  inference 

How  does  MDL  compare  to  other  model  selection  and  statistical  inference  methods? 
In  order  to  answer  this  question,  we  first  have  to  be  precise  about  what  we  mean  by 
‘MDL’;  this  is  done  in  section  4.8.1.  We  then  continue  in  section  4.8.2  by  summarizing 
MDL’s  relation  to  Bayesian  inference,  Wallace’s  MML  Principle,  Dawid’s  prequential 
model  validation,  cross-validation  and  an  ‘idealized’  version  of  MDL  based  on  Kol¬ 
mogorov  complexity.  The  literature  has  also  established  connections  between  MDL 
and  Jaynes’  [57]  maximum  entropy  principle  [29],  [71],  [44],  [46],  [49]  and  Vap- 
nik’s  [115]  structural  risk  minimization  principle  [44].  Relations  between  MDL  and 
Akaike’s  AIC  [14]  are  subtle.  They  are  discussed  in  [108]. 

4.8.1  What  is  MDL? 

‘MDL’  is  used  by  different  authors  in  somewhat  different  meanings.  Some  authors  use 
MDL  as  a  broad  umbrella  term  for  all  types  of  inductive  inference  based  on  data  com¬ 
pression.  This  would,  for  example,  include  the  ‘idealized’  versions  of  MDL  based  on 
Kolmogorov  complexity  and  Wallaces ’s  MML  Principle,  to  be  discussed  below.  On  the 
other  extreme,  for  historical  reasons,  some  authors  use  the  MDL  criterion  to  describe 
a  very  specific  (and  often  not  very  successful)  model  selection  criterion  equivalent  to 
BIC,  discussed  further  below. 

Here  we  adopt  the  meaning  of  the  term  that  is  embraced  in  the  survey  [9],  written  by 
arguably  the  three  most  important  contributors  to  the  field:  MDL  for  general  infer¬ 
ence  based  on  universal  models.  These  include,  but  are  not  limited  to  approaches  in 
the  spirit  of  observation  4.4.  For  example,  some  authors  have  based  their  inferences 
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on  ‘expected’  rather  than  ‘individual  sequence’  universal  models  [9],  [73].  Moreover, 
if  we  go  beyond  model  selection  (section  4.7),  then  the  ideas  of  observation  4.4  have 
to  be  modified  to  some  extent.  In  fact,  one  of  the  main  strengths  of  ‘MDL’  in  this 
broad  sense  is  that  it  can  be  applied  to  ever  more  exotic  modeling  situations,  in  which 
the  models  do  not  resemble  anything  that  is  usually  encountered  in  statistical  practice. 
An  example  is  the  model  of  context-free  grammars,  already  suggested  by  Solomonoff 
[106].  In  this  chapter,  we  call  applications  of  MDL  that  strictly  fit  into  the  scheme 
of  observation  4.4  refined  MDL  for  model/hypothesis  selection;  when  we  simply  say 
‘MDL’,  we  mean  ‘inductive  inference  based  on  universal  models’.  This  form  of  in¬ 
ductive  inference  goes  hand  in  hand  with  Rissanen’s  radical  MDL  philosophy,  which 
views  learning  as  finding  useful  properties  of  the  data,  not  necessarily  related  to  the 
existence  of  a  ‘truth’  underlying  the  data.  This  view  was  outlined  in  chapter  3,  section 
3.5.  Although  MDL  practitioners  and  theorists  are  usually  sympathetic  to  it,  the  differ¬ 
ent  interpretations  of  MDL  listed  in  section  4.5  make  clear  that  MDL  applications  can 
also  be  justified  without  adopting  such  a  radical  philosophy. 


4.8.2  MDL  and  Bayesian  inference 

Bayesian  statistics  [70],  [10]  is  one  of  the  most  well-known,  frequently  and  success¬ 
fully  applied  paradigms  of  statistical  inference.  It  is  often  claimed  that  ‘MDL  is  really 
just  a  special  case  of  Bayes*’.  Although  there  are  close  similarities,  this  is  simply  not 
true.  To  see  this  quickly,  consider  the  basic  quantity  in  refined  MDL:  the  NML  distri¬ 
bution  Pnmi ,  equation  (4.18).  While  Pnmi  -  although  defined  in  a  completely  different 
manner  -  turns  out  to  be  closely  related  to  the  Bayesian  marginal  likelihood,  this  is  no 
longer  the  case  for  its  ‘localized’  version  (4.37).  There  is  no  mention  of  anything  like 
this  code/distribution  in  any  Bayesian  textbook!  Consequently,  it  must  be  the  case  that 
Bayes  and  MDL  are  somehow  different. 

MDL  as  a  maximum  probability  principle  For  a  more  detailed  analysis,  we  need  to 
distinguish  between  the  two  central  tenets  of  modem  Bayesian  statistics:  (1)  Probabil¬ 
ity  distributions  are  used  to  represent  uncertainty,  and  to  serve  as  a  basis  for  making 
predictions;  rather  than  standing  for  some  imagined  ‘true  state  of  nature’.  (2)  All  infer¬ 
ence  and  decision-making  is  done  in  terms  of  prior  and  posterior  distributions.  MDL 
sticks  with  (1)  (although  here  the  ‘distributions’  are  primarily  interpreted  as  ‘code 
length  functions’),  but  not  (2):  MDL  allows  the  use  of  arbitrary  universal  models  such 
as  NML  and  prequential  universal  models;  the  Bayesian  universal  model  does  not  have 
a  special  status  among  these.  In  this  sense,  Bayes  offers  the  statistician  less  freedom 
in  choice  of  implementation  than  MDL.  In  fact,  MDL  may  be  reinterpreted  as  a  max¬ 
imum  probability  principle,  where  the  maximum  is  relative  to  some  given  model,  in 
the  worst-case  over  all  sequences  (Rissanen  [92],  [93]  uses  the  phrase  ‘global  maxi¬ 
mum  likelihood  principle’).  Thus,  whenever  the  Bayesian  universal  model  is  used  in 
an  MDL  application,  a  prior  should  be  used  that  minimizes  worst-case  code  length 
regret,  or  equivalently,  maximizes  worst-case  relative  probability.  There  is  no  compa¬ 
rable  principle  for  choosing  priors  in  Bayesian  statistics,  and  in  this  respect,  Bayes 
offers  a  lot  more  freedom  than  MDL. 

*  The  reasons  are  probably  historical:  while  the  underlying  philosophy  has  always  been  dif¬ 
ferent,  most  actual  implementations  of  MDL  ‘looked’  quite  Bayesian  until  Rissanen  introduced 
the  use  of  Pnmb 
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Example  4.20 

There  is  a  conceptual  problem  with  Bayes’  use  of  prior  distributions:  in  practice,  we 
very  often  want  to  use  models  which  we  a  priori  know  to  be  wrong  -  see  example  3.5. 
If  we  use  Bayes  for  such  models,  then  we  are  forced  to  put  a  prior  distribution  on  a  set 
of  distributions  which  we  know  to  be  wrong.  From  an  MDL  viewpoint,  these  priors 
are  interpreted  as  tools  to  achieve  short  code  lengths  rather  than  degrees-of-belief  and 
there  is  nothing  strange  about  the  situation;  but  from  a  Bayesian  viewpoint,  it  seems 
awkward.  To  be  sure,  Bayesian  inference  often  gives  good  results  even  if  the  model 
M  is  known  to  be  wrong;  the  point  is  that  (a)  if  one  is  a  strict  Bayesian,  one  would 
never  apply  Bayesian  inference  to  such  misspecified  M,  and  (b),  the  Bayesian  theory 
offers  no  clear  explanation  of  why  Bayesian  inference  might  still  give  good  results  for 
such  M.  MDL  provides  both  code  length  and  predictive  sequential  interpretations  of 
Bayesian  inference,  which  help  explain  why  Bayesian  inference  may  do  something 
reasonable  even  if  M  is  misspecified.  To  be  fair,  we  should  add  that  there  exists  vari¬ 
ations  of  the  Bayesian  philosophy  (e.g.  [26])  which  avoid  the  conceptual  problem  we 
just  described.  0 

MDL  and  BIC  In  the  first  paper  on  MDL,  Rissanen  [88]  used  a  two-part  code  and 
showed  that,  asymptotically,  and  under  regularity  conditions,  the  two-part  code  length 
of  xn  based  on  a  fc-parameter  model  M  with  an  optimally  discretized  parameter  space 
is  given  by 

L  —  —  logP(xn\9(xn))  +  |  logn.  (4.42) 

Note  that  the  0( l)-terms  are  ignored.  However,  as  we  have  already  seen,  they  can 
be  quite  important.  In  the  same  year  Schwarz  [100]  showed  that,  for  large  enough  n, 
Bayesian  model  selection  between  two  exponential  families  amounts  to  selecting  the 
model  minimizing  (4.42),  ignoring  C?(l)-terms  as  well.  As  a  result  of  Schwarz’s  paper, 
model  selection  based  on  (4.42)  became  known  as  the  BIC  (Bayesian  information 
criterion).  Not  taking  into  account  the  functional  form  of  the  model  M,  it  often  does 
not  work  very  well  in  practice. 

It  has  sometimes  been  claimed  that  MDL  =  BIC;  for  example,  in  [14],  page  286  it 
is  written  ‘Rissanen’s  result  is  equivalent  to  BIC’.  This  is  wrong,  even  for  the  1989 
version  of  MDL  that  is  referred  to  -  as  pointed  out  in  [34],  the  BIC  approximation 
only  holds  if  the  number  of  parameters  k  is  kept  fixed  and  n  goes  to  infinity.  If  we 
select  between  nested  families  of  models  where  the  maximum  number  of  parameters  k 
considered  is  either  infinite  or  grows  with  n,  then  model  selection  based  on  both  Pnmi 
and  on  Psayes  tends  to  select  quite  different  models  than  BIC  -  if  k  gets  closer  to  n, 
the  contribution  to  COMPn(A/f)  of  each  additional  parameter  becomes  much  smaller 
than  0.5  log  n  [34],  However,  researchers  who  claim  MDL  =  BIC  have  a  good  excuse: 
in  early  work,  Rissanen  himself  has  used  the  phrase  ‘MDL  criterion’  to  refer  to  (4.42), 
and  unfortunately,  the  phrase  has  stuck. 

MDL  and  MML  MDL  shares  some  ideas  with  the  minimum  message  length  (MML) 
principle  which  predates  MDL  by  10  years.  Key  references  are  [121],  [122]  and  [123], 
a  long  list  is  presented  in  [19].  Just  as  in  MDL,  MML  chooses  the  hypothesis  mini¬ 
mizing  the  code-length  of  the  data.  But  the  codes  that  are  used  are  quite  different  from 
those  in  MDL.  First  of  all,  in  MML  one  always  uses  two-part  codes,  so  that  MML 
automatically  selects  both  a  model  family  and  parameter  values.  Second,  while  the 
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MDL  codes  such  as  Pnmi  minimize  worst-case  relative  code-length  (regret),  the  two- 
part  codes  used  by  MML  are  designed  to  minimize  expected  absolute  code-length. 
Here  the  expectation  is  taken  over  a  subjective  prior  distribution  defined  on  the  collec¬ 
tion  of  models  and  parameters  under  consideration.  While  this  approach  contradicts 
Rissanen’s  philosophy,  in  practice  it  often  leads  to  similar  results. 

Indeed,  Wallace  and  his  co-workers,  [121],  [122]  and  [123],  stress  that  their  approach 
is  fully  (subjective)  Bayesian.  Strictly  speaking,  a  Bayesian  should  report  his  findings 
by  citing  the  full  posterior  distribution.  But  sometimes  one  is  interested  in  a  single 
model,  or  hypothesis  for  the  data.  A  good  example  is  the  inference  of  phylogenetic 
trees  in  biological  applications:  the  full  posterior  would  consist  of  a  mixture  of  sev¬ 
eral  of  such  trees,  which  might  all  be  quite  different  from  each  other.  Such  a  mixture 
is  almost  impossible  to  interpret  -  to  get  insight  in  the  data  we  need  a  single  tree. 
In  that  case,  Bayesians  often  use  the  MAP  (maximum  a  posteriori)  hypothesis  which 
maximizes  the  posterior,  or  the  posterior  mean  parameter  value.  The  first  approach 
has  some  unpleasant  properties,  for  example,  it  is  not  invariant  under  reparameteriza¬ 
tion.  The  posterior  mean  approach  cannot  be  used  if  different  model  families  are  to  be 
compared  with  each  other.  The  MML  method  provides  a  theoretically  sound  way  of 
proceeding  in  such  cases. 


4.8.3  MDL,  prequential  analysis  and  cross  validation 

In  a  series  of  papers,  [22],  [23],  [24]  Dawid  put  forward  a  methodology  for  prob¬ 
ability  and  statistics  based  on  sequential  prediction  which  he  called  the  prequential 
approach.  When  applied  to  model  selection  problems,  it  is  closely  related  to  MDL  - 
Dawid  proposes  to  construct,  for  each  model  M  ^  under  consideration,  a  ‘probability 
forecasting  system’  (a  sequential  prediction  strategy)  where  the  i  +  1-st  outcome  is 
predicted  based  on  either  the  Bayesian  posterior  PBayes(^ \x%)  or  on  some  estimator 
6{xl).  Then  the  model  is  selected  for  which  the  associated  sequential  prediction  strat¬ 
egy  minimizes  the  accumulated  prediction  error.  Related  ideas  were  put  forward  in  [55] 
under  the  name  forward  validation  and  in  [90].  From  section  4.5.4  we  see  that  these 
are  just  forms  of  MDL  -  strictly  speaking,  every  universal  code  can  be  thought  of  as 
as  prediction  strategy,  but  for  the  Bayesian  and  the  plug-in  universal  models  (sections 

4.5.3  and  4.5.4)  the  interpretation  is  much  more  natural  than  for  others^.  Dawid  mostly 
talks  about  such  ‘predictive’  universal  models.  On  the  other  hand,  Dawid’s  framework 
allows  to  adjust  the  prediction  loss  to  be  measured  in  terms  of  arbitrary  loss  functions, 
not  just  the  log  loss.  In  this  sense,  it  is  more  general  than  MDL.  Finally,  the  prequential 
idea  goes  beyond  statistics:  there  is  also  a  ‘prequential  approach’  to  probability  theory 
developed  by  Dawid  [25]  and  Shafer  and  Vovk  [101]. 

Note  that  the  prequential  approach  is  similar  in  spirit  to  cross-validation.  In  this  sense 
MDL  is  related  to  cross-validation  as  well.  The  main  differences  are  that  in  MDL  and 
the  prequential  approach,  (1)  all  predictions  are  done  sequentially  (the  future  is  never 
used  to  predict  the  past),  and  (2)  each  outcome  is  predicted  exactly  once. 

1  The  reason  is  that  Bayesian  and  plug-in  models  can  be  interpreted  as  probabilistic  sources. 
The  NML  and  the  two-part  code  models  are  no  probabilistic  sources,  since  P^  and 
are  not  compatible  in  the  sense  of  section  4. 1 . 
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4.8.4  Kolmogorov  complexity  and  structure  function:  ideal  MDL 

Kolmogorov  complexity  [71]  has  played  a  large  but  mostly  inspirational  role  in  Rissa- 
nen’s  development  of  MDL.  Over  the  last  fifteen  years,  several  ‘idealized’  versions  of 
MDL  have  been  proposed,  which  are  more  directly  based  on  Kolmogorov  complexity 
theory  [6],  [8],  [71],  [116].  These  are  all  based  on  two-part  codes,  where  hypothe¬ 
ses  are  described  using  a  universal  programming  language  such  as  C  or  Pascal.  For 
example,  in  one  proposal  [8],  given  data  D  one  picks  the  distribution  minimizing 

jr(P)  +  (-logP(£>)),  (4.43) 

where  the  minimum  is  taken  over  all  computable  probability  distributions,  and  K(P ) 
is  the  length  of  the  shortest  computer  program  that,  when  input  (x,  d),  outputs  P(x)  to 
d  bits  precision.  While  such  a  procedure  is  mathematically  well-defined,  it  cannot  be 
used  in  practice.  The  reason  is  that  in  general,  the  P  minimizing  (4.43)  cannot  be  effec¬ 
tively  computed.  Kolmogorov  himself  used  a  variation  of  (4.43)  in  which  one  adopts, 
among  all  P  with  K (P)  -  log  P(D)  ^  K (D),  the  P  with  smallest  K (P).  Here  K ( D ) 
is  the  Kolmogorov  complexity  of  D ,  that  is,  the  length  of  the  shortest  computer  pro¬ 
gram  that  prints  D  and  then  halts.  This  approach  is  known  as  the  Kolmogorov  structure 
function  or  minimum  sufficient  statistic  approach  [120].  In  this  approach,  the  idea  of 
separating  data  and  noise  (section  4.5.1)  is  taken  as  basic,  and  the  hypothesis  selection 
procedure  is  defined  in  terms  of  it.  The  selected  hypothesis  may  now  be  viewed  as 
capturing  all  structure  inherent  in  the  data  -  given  the  hypothesis,  the  data  cannot  be 
distinguished  from  random  noise.  Therefore,  it  may  be  taken  as  a  basis  for  lossy  data 
compression  -  rather  than  sending  the  whole  sequence,  one  only  sends  the  hypothesis 
representing  the  ‘structure’  in  the  data.  The  receiver  can  then  use  this  hypothesis  to 
generate  ‘typical’  data  for  it  -  this  data  should  then  ‘look  just  the  same’  as  the  original 
data  D.  Rissanen  views  this  separation  idea  as  perhaps  the  most  fundamental  aspect 
of  ‘learning  by  compression’.  Therefore,  in  recent  work  he  has  tried  to  relate  MDL 
(as  defined  here,  based  on  lossless  compression)  to  the  Kolmogorov  structure  func¬ 
tion  (thereby  connecting  it  to  lossy  compression)  and,  as  he  puts  it,  ‘opening  up  a  new 
chapter  in  the  MDL  theory’  [116],  [120],  [99]. 

Summary  We  have  shown  that  MDL  is  closely  related,  yet  distinct  from,  to  several 
other  methods  for  inductive  inference.  In  the  next  section  we  discuss  how  well  it  per¬ 
forms  compared  to  such  other  methods. 


4.9  Problems  for  MDL? 

Some  authors  have  criticized  MDL  either  on  conceptual  grounds  (the  idea  makes  no 
sense)  [124],  [27]  or  on  practical  grounds  (sometimes  it  does  not  work  very  well  in 
practice)  [64],  [82],  Are  these  criticisms  justified?  Let  us  consider  them  in  turn. 

4.9.1  Conceptual  problems:  Occam’s  razor 

The  most-often  heard  conceptual  criticisms  are  invariably  related  to  Occam’s  razor.  We 
have  already  discussed  in  section  3.5  of  the  previous  chapter  why  we  regard  these  crit¬ 
icisms  as  being  entirely  mistaken.  Based  on  our  newly  acquired  technical  knowledge 
of  MDL,  let  us  discuss  these  criticisms  a  little  bit  further: 
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1.  ‘Occam’s  razor  (and  MDL)  is  arbitrary’  If  we  restrict  ourselves  to  refined  MDL 
for  comparing  a  finite  number  of  models  for  which  the  NML  distribution  is  well- 
defined,  then  there  is  nothing  arbitrary  about  MDL  -  it  is  exactly  clear  what  codes 
we  should  use  for  our  inferences.  The  NML  distribution  and  its  close  cousins,  the  Jef¬ 
freys’  prior  marginal  likelihood  PBayes  and  the  asymptotic  expansion  (4.21)  are  all 
invariant  to  continuous  1  -to- 1  reparameterizations  of  the  model:  parameterizing  our 
model  in  a  different  way  (choosing  a  different  ‘description  language’)  does  not  change 
the  inferred  description  lengths. 

If  we  go  beyond  models  for  which  the  NML  distribution  is  defined,  and/or  we  compare 
an  infinite  set  of  models  at  the  same  time,  then  some  ‘subjectivity’  is  introduced  -  while 
there  are  still  tough  restrictions  on  the  codes  that  we  are  allowed  to  use,  all  such  codes 
prefer  some  hypotheses  in  the  model  over  others.  If  one  does  not  have  an  a  priori 
preference  over  any  of  the  hypotheses,  one  may  interpret  this  as  some  arbitrariness 
being  added  to  the  procedure.  But  this  ‘arbitrariness’  is  of  an  infinitely  milder  sort  than 
the  arbitrariness  that  can  be  introduced  if  we  allow  completely  arbitrary  codes  for  the 
encoding  of  hypotheses  as  in  crude  two-part  code  MDL,  section  4.3. 

Things  get  more  subtle  if  we  are  interested  not  only  in  model  selec¬ 
tion  (find  the  best  order  Markov  chain  for  the  data)  but  also  in  infinite¬ 
dimensional  estimation  (find  the  best  Markov  chain  parameters  for  the 
data,  among  the  set  B  of  all  Markov  chains  of  each  order).  In  the  latter 
case,  if  we  are  to  apply  MDL,  we  somehow  have  to  carve  up  B  into 
subsets  C  C  •  •  •  C  B.  Suppose  that  we  have  already  cho¬ 
sen  M  W  =  B M  as  the  set  of  1-st  order  Markov  chains.  We  normally 
take  M =  B^°\  the  set  of  0-th  order  Markov  chains  (Bernoulli 
distributions).  But  we  could  also  have  defined  as  the  set  of  all 
1-st  order  Markov  chains  with  P(X{+ 1  =  1\X{  =  1)  =  P(2Q_ |_i  = 

0| Xi  =  0).  This  defines  a  one-dimensional  subset  of  B W  that  is  not 
equal  to  B^°\  While  there  are  several  good  reasons*  for  choosing  B 
rather  than  A^0),  there  may  be  no  indication  that  B^  is  somehow  a 
priori  more  likely  than  While  MDL  tells  us  that  we  somehow 

have  to  carve  up  the  full  set  S,  it  does  not  give  us  precise  guidelines 
on  how  to  do  this  -  different  carvings  may  be  equally  justified  and 
lead  to  different  inferences  for  small  samples.  In  this  sense,  there  is 
indeed  some  form  of  arbitrariness  in  this  type  of  MDL  applications. 

But  this  is  unavoidable:  we  stress  that  this  type  of  arbitrariness  is  en¬ 
forced  by  all  combined  model/parameter  selection  methods  -  whether 
they  be  of  the  structural  risk  minimization  type  [115],  AlC-type  [14], 
cross-validation  or  any  other  type.  The  only  alternative  is  treating  all 
hypotheses  in  the  huge  class  B  on  the  same  footing,  which  amounts  to 
maximum  likelihood  estimation  and  extreme  overfitting. 

2.  ‘Occam’s  razor  is  false’  We  often  try  to  model  real-world  situations  that  can  be 
arbitrarily  complex,  so  why  should  we  favor  simple  models?  We  gave  an  informal 
answer  in  chapter  3  where  we  claimed  that  even  if  the  true  data  generating  machinery 
is  very  complex,  it  may  be  a  good  strategy  to  prefer  simple  models  for  small  sample 
sizes. 


For  example,  B ^  is  better  interpretable. 
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We  are  now  in  a  position  to  give  one  formalization  of  this  informal  claim:  it  is  simply 
the  fact  that  MDL  procedures,  with  their  built-in  preference  for  ‘simple’  models  with 
small  parametric  complexity,  are  typically  statistically  consistent  achieving  good  rates 
of  convergence,  whereas  methods  such  as  maximum  likelihood  -  which  do  not  take 
model  complexity  into  account  -  are  typically  inconsistent  whenever  they  are  applied 
to  complex  enough  models  such  as  the  set  of  polynomials  of  each  degree  or  the  set 
of  Markov  chains  of  all  orders.  This  has  implications  for  the  quality  of  predictions 
based  on  complex  enough  models,  no  matter  how  many  training  data  we  observe.  If  we 
use  the  maximum  likelihood  distribution  to  predict  future  data  from  the  same  source, 
the  prediction  error  we  make  will  not  converge  to  the  prediction  error  that  could  be 
obtained  if  the  true  distribution  were  known;  if  we  use  an  MDL  submodel/parameter 
estimate  (section  4.7),  the  prediction  error  will  converge  to  this  optimal  achievable 
error. 

Of  course,  consistency  is  not  the  only  desirable  property  of  a  learning  method,  and  it 
may  be  that  in  some  particular  settings,  and  under  some  particular  performance  mea¬ 
sures,  some  alternatives  to  MDL  outperform  MDL.  Indeed  this  can  happen  -  see  below. 
Yet  it  remains  the  case  that  all  methods  we  know  of  that  successfully  deal  with  models 
of  arbitrary  complexity  have  a  built-in  preference  for  selecting  simpler  models  at  small 
sample  sizes  -  methods  such  as  Vapnik’s  [115]  structural  risk  minimization,  penalized 
minimum  error  estimators  [7]  and  the  Akaike  criterion  [14]  all  trade-off  complexity 
with  error  on  the  data,  the  result  invariably  being  that  in  this  way,  good  convergence 
properties  can  be  obtained.  While  these  approaches  measure  ‘complexity’  in  a  manner 
different  from  MDL,  and  attach  different  relative  weights  to  error  on  the  data  and  com¬ 
plexity,  the  fundamental  idea  of  finding  a  trade-off  between  ‘error’  and  ‘complexity’ 
remains. 


4.9.2  Practical  problems  with  MDL 

We  just  described  some  perceived  problems  about  MDL.  Unfortunately,  there  are  also 
some  real  ones:  MDL  is  not  a  perfect  method.  While  in  many  cases,  the  methods  de¬ 
scribed  here  perform  very  well^  there  are  also  cases  where  they  perform  suboptimally 
compared  to  other  state-of-the-art  methods.  Often  this  is  due  to  one  of  two  reasons: 

1.  An  asymptotic  formula  like  (4.21)  was  used  and  the  sample  size  was  not  large 
enough  to  justify  this  [81]. 

2.  Pnmi  was  undefined  for  the  models  under  consideration,  and  this  was  solved 
by  cutting  off  the  parameter  ranges  at  ad  hoc  values  [69]. 

In  these  cases  the  problem  probably  lies  in  the  use  of  invalid  approximations  rather 
than  with  the  MDL  idea  itself.  More  research  is  needed  to  find  out  when  the  asymp¬ 
totics  and  other  approximations  can  be  trusted,  and  what  is  the  ‘best’  way  to  deal 
with  undefined  Pnmi .  For  the  time  being,  we  suggest  to  avoid  using  (4.21)  whenever 
possible,  and  to  never  cut  off  the  parameter  ranges  at  arbitrary  values  -  instead,  if 

1  We  mention  [51],  [52]  reporting  excellent  behavior  of  MDL  in  regression  contexts;  and 
[2],  [67],  [79]  reporting  excellent  behavior  of  predictive  (prequential)  coding  in  Bayesian  net¬ 
work  model  selection  and  regression.  Also,  ‘objective  Bayesian’  model  selection  methods  are 
frequently  and  successfully  used  in  practice  [63].  Since  these  are  based  on  non-informative 
priors  such  as  Jeffreys’,  they  often  coincide  with  a  version  of  refined  MDL  and  thus  indicate 
successful  performance  of  MDL. 
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COMPn(.A/f)  becomes  infinite,  then  some  of  the  methods  described  in  section  4.6.2 
should  be  used.  Given  these  restrictions,  Pnmi  and  Bayesian  inference  with  Jeffreys’ 
prior  are  the  preferred  methods,  since  they  both  achieve  the  minimax  regret.  If  they  are 
either  ill-defined  or  computationally  prohibitive  for  the  models  under  consideration, 
one  can  use  a  prequential  method  or  a  sophisticated  two-part  code  such  as  described 
by  Barron  and  Cover  [8]. 

MDL  and  misspecification  However,  there  is  a  class  of  problems  where  MDL  is  prob¬ 
lematic  in  a  more  fundamental  sense.  Namely,  if  none  of  the  distributions  under  con¬ 
sideration  represents  the  data  generating  machinery  very  well,  then  both  MDL  and 
Bayesian  inference  may  sometimes  do  a  bad  job  in  finding  the  ‘best’  approximation 
within  this  class  of  not-so-good  hypotheses.  This  has  been  observed  in  practice*  [64], 
[17],  [82].  In  [49]  it  is  shown  that  MDL  can  behave  quite  unreasonably  for  some  clas¬ 
sification  problems  in  which  the  true  distribution  is  not  in  M.  This  is  closely  related 
to  the  problematic  behavior  of  MDL  for  classification  tasks  as  mentioned  in  section 
4.7.  All  this  is  a  bit  ironic,  since  MDL  was  explicitly  designed  not  to  depend  on  the 
untenable  assumption  that  some  P*  e  M  generates  the  data.  But  empirically  we  find 
that  while  it  generally  works  quite  well  if  some  P*  €  M  generates  the  data,  it  may 
sometimes  fail  if  this  is  not  the  case. 


4.10  Discussion 

MDL  is  a  versatile  method  for  inductive  inference:  it  can  be  interpreted  in  at  least  four 
different  ways,  all  of  which  indicate  that  it  does  something  reasonable.  It  is  typically 
asymptotically  consistent,  achieving  good  rates  of  convergence.  It  achieves  all  this 
without  having  been  designed  for  consistency,  being  based  on  a  philosophy  which 
makes  no  metaphysical  assumptions  about  the  existence  of  ‘true’  distributions.  These 
facts  strongly  suggest  that  it  is  a  good  method  to  use  in  practice.  Practical  evidence 
confirms  this  in  many  contexts,  in  other  contexts  its  behavior  can  be  problematic.  The 
main  challenge  for  the  future  is  to  improve  MDL  for  such  cases,  by  somehow  extending 
and  further  refining  MDL  procedures  in  a  non  ad-hoc  manner.  There  is  confidence 
that  this  can  be  done,  and  that  MDL  will  continue  to  play  an  important  role  in  the 
development  of  statistical,  and  more  generally,  in  inductive  inference. 

Further  reading  Good  places  to  start  further  exploration  of  MDL  are  [7]  and  [52]. 
Both  papers  provide  excellent  introductions,  but  they  are  geared  towards  a  more  spe¬ 
cialized  audience  of  information  theorists  and  statisticians,  respectively.  Also  worth 
reading  is  Rissanen’s  [93]  monograph.  While  outdated  as  an  introduction  to  MDL 
methods,  this  famous  ‘little  green  book’  still  serves  as  a  great  introduction  to  Rissa¬ 
nen’s  radical  but  appealing  philosophy,  which  is  described  very  eloquently. 


*  However,  see  [117]  where  it  is  pointed  out  that  the  problem  of  [64]  disappears  if  a  more 
reasonable  coding  scheme  is  used. 
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5.  Model  uncertainty 


We  consider  the  problems  of  variable  selection  and  accounting  for  model  uncertainty 
in  linear  models.  Conditioning  in  a  single  selected  model,  ignores  model  uncertainty 
and  thus  leads  to  underestimation  of  uncertainty  when  making  inferences  about  quan¬ 
tities  of  interest.  The  complete  Bayesian  solution  to  this  problem  involves  averaging 
over  all  possible  models  when  making  inferences.  This  approach  is  often  not  practi¬ 
cal.  In  this  chapter  we  offer  two  alternative  approaches.  First,  we  describe  a  Bayesian 
model  selection  algorithm  called  ‘Occam’s  window’  which  involves  averaging  over  a 
reduced  set  of  models.  Second,  we  describe  a  Markov  chain  Monte  Carlo  approach 
which  directly  approximates  the  exact  solution. 

The  selection  of  subsets  of  predictor  variables  is  a  basic  part  of  building  a  linear  model. 
The  objective  of  variable  selection  is  typically  stated  as  follows:  given  an  independent 
variable  Y  and  a  set  of  candidate  predictors  *  •  *  ,  X £,  find  the  best  model  of 

the  form 

Y  =  f30  +  '£pjXj  +  e, 
j= 1 

where  Xy,  X2 ,  •  •  •  ,  Xp  is  a  subset  of  Xy,  X2,  •  •  •  ,  X^. 

In  this  chapter  we  embed  this  model  selection  problem  in  the  larger  framework  of 
accounting  for  model  uncertainty.  We  argue  that  conditioning  on  a  single  selected 
model  ignores  model  uncertainty,  and  that  this,  in  turn,  leads  to  underestimation  of 
uncertainty  when  making  inferences  about  quantities  of  interest.  A  complete  Bayesian 
solution  to  this  problem  involves  averaging  over  all  possible  models  when  making  in¬ 
ferences  about  quantities  of  interest.  Indeed,  this  approach  provides  optimal  predictive 
ability  [75].  In  many  applications  however,  this  averaging  will  not  be  a  practical  propo¬ 
sition  and  here  we  present  two  alternative  approaches.  First,  we  extend  the  Bayesian 
model  selection  algorithm  [75]  to  linear  regression  models.  We  refer  to  this  algorithm 
as  Occam’s  window.  Appealing  to  scientific  norms,  this  approach  involves  averaging 
over  a  reduced  set  of  models  and  allows  for  effective  communication  to  the  analyst. 
Second,  we  directly  approximate  the  complete  solution  by  applying  the  Markov  chain 
Monte  Carlo  approach  [76]  to  linear  regression  models.  In  this  approach  the  posterior 
distribution  of  a  quantity  of  interest  is  approximated  by  a  Markov  chain  Monte  Carlo 
method  which  generates  a  process  that  moves  through  model  space. 


5.1  Accounting  for  model  uncertainty 

A  typical  approach  to  data  analysis  is  to  carry  out  a  model  selection  exercise  leading  to 
a  single  ‘best’  model  and  then  to  make  inference  as  if  the  selected  model  were  the  true 
model.  However,  this  ignores  a  major  component  of  uncertainty,  namely  uncertainty 
about  the  model  itself  [28].  As  a  consequence,  uncertainty  about  quantities  of  interest 
can  be  underestimated.  For  striking  example  of  this  see  [76]. 
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There  is  a  standard  Bayesian  solution  to  this  problem.  If  M  —  {Mi,  •  •  •  ,  M*.}  denotes 
the  set  of  all  models  being  considered  and  if  A  is  the  quantity  of  interest  such  as  a 
future  observation  or  the  utility  of  a  course  of  action,  then  the  posterior  distribution  of 
A  given  the  data  D  is 


K 

pr(A\D)  -  ^2pr(A\Mk,  D)pr(Mk\D). 
fc=i 


(5.1) 


This  is  an  average  of  the  posterior  distributions  under  each  model  weighted  by  the  cor¬ 
responding  posterior  model  probabilities.  In  (5.1),  the  posterior  distribution  of  model 
Mfc  is  given  by 


pr(Mk\D)  = 


pr(£>|A4)pr(Mfc) 

ELMD\Mi)pr(Miy 


(5.2) 


where 

pr(D\Mk)  =  J  pr(D\6k,Mk)pr(6k\Mk)d6k,  (5.3) 


is  the  marginal  likelihood  of  model  M^,  6 ^  is  the  vector  of  parameters  of  model  M 
pr(9h\Mh)  is  the  prior  density  of  0*.  under  model  M&,  pr(D\0fc ,  M '*)  is  the  likelihood, 
and  pr(Mk)  is  the  prior  probability  that  M*.  is  the  true  model.  All  probabilities  are 
implicitly  conditional  on  M ,  the  set  of  ail  models  being  considered. 


Implementations  of  the  above  strategy  is  difficult  for  two  reasons.  First,  the  integral  in 
(5.3)  can  be  hard  to  compute.  Second,  the  number  of  terms  in  (5.1)  can  be  enormous. 
In  what  follows  we  present  workable  solutions  to  both  of  these  problems. 


5.2  Bayesian  framework  and  selection  of  prior  distributions 

Each  model  we  consider  is  of  the  form 

p 

Y  =  0o  +  '£t0JXj  +  c  =  X0  +  e, 

j= 1 

where  the  observed  data  on  the  predictors  are  contained  in  the  nx(p+l)  matrix  X 
and  the  observed  data  on  the  dependent  variable  are  contained  in  the  n- vector  Y.  We 
assign  to  e  a  normal  distribution  with  zero  mean  and  variance  a2  and  assume  that  the 
e’s  in  distinct  cases  are  independent.  We  consider  the  (p  +  1)  parameters  (3  and  a 2  to 
be  unknown. 

Where  possible,  informative  prior  distributions  for  (3  and  a2  should  be  elicited  and 
incorporated  into  the  analysis  [38].  In  the  absence  of  expert  opinion  we  seek  to  choose 
prior  distributions  which  reflect  uncertainty  about  the  parameters  and  also  embody 
reasonable  a  priori  constraints.  We  use  the  standard  normal-gamma  conjugate  class, 
A/*,  of  priors 

(3  «  Af{p,a2V),  xl ■ 

Here  u,  A,  the  (p  +  1)  x  (p  +  1)  matrix  V  and  (p  +  l)-vector  p  are  hyperparameters 
to  be  chosen. 
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For  non-categorical  predictor  variables  we  assume  the  individual  /?’s  to  be  independent 
a  priori.  We  center  the  distribution  of  j3  on  zero  (apart  from  (3q )  and  choose  fi  = 
(/3o>  0, 0,  •  *  •  ,0)  where  (3q  is  the  ordinary  least  squares  estimate  of  /?o*  The  covariance 
matrix  V  is  diagonal  with  entries  ( s2 ,  </>2s^2,  2,  •  *  *  ,  </>s~ 2)  where  s2  denotes  the 

sample  variance  of  Y ,  s^2  denotes  the  sample  variance  of  Xi  for  i  =  1,  *  •  ■  ,  p,  and  <f>  is 
a  hyperparameter  to  be  chosen.  The  prior  variance  of  /?o  is  chosen  conservatively  and 
represents  an  upper  bound  on  the  reasonable  variance  of  this  parameter.  The  variance 
of  the  remaining  /3-parameters  are  chosen  to  reflect  increasing  precision  about  each  /?* 
as  the  variance  of  the  corresponding  Xi  increases  and  to  be  invariant  to  scale  changes 
in  both  the  predictor  variables  and  the  response  variable. 


For  a  categorical  predictor  variable  Xi  with  (c  +  1)  possible  outcomes  (c  >  2),  the 
Bayes  factors  should  be  invariant  to  the  selection  of  the  corresponding  dummy  vari¬ 
ables  (Xa,  •  •  ,  Xic ).  To  this  end  we  set  the  prior  variance  of  •  *  *  ,  $c )  equal  to 
^(^(^(X^X'1)-1  where  X1  is  the  design  matrix  for  the  dummy  variables,  where 
each  dummy  variable  has  been  centered  by  subtracting  its  sample  mean.  This  is  related 
to  the  g -prior  in  [126]  and  the  complete  prior  covariance  matrix  for  /3  is  now  given  by 


(s 


2 

Y 


\ 


V(P)  =  a2 


<t> 


>2s“2 
V  i 


-1 


r.~2 


1 


V 


<!>sp2  / 


To  choose  the  remaining  hyperparameters  v ,  A  and  we  define  a  number  of  reasonable 
desiderata  and  attempt  to  satisfy  them.  In  what  follows  we  assume  that  all  the  variables 
have  been  standardized  to  have  zero  mean  and  sample  variance  one.  We  would  like 

•  The  prior  density  p(/?i,  *  •  *  ,  /3p)  to  be  reasonable  flat  over  the  unit  hypercube 

[-UF. 

•  p(a2)  to  be  reasonably  flat  over  (a,  1)  for  some  small  a. 

•  pr(cr 2  <  1)  to  be  large. 

The  order  of  importance  of  these  desiderata  is  roughly  the  order  in  which  they  are 
listed.  More  formally,  we  maximize  pr{a  <  1)  subject  to 


pr(A  =  0.-"  ,Pp  =  0)  <  K 

W{Pl  =  !,•••  ,Pp  =  1)  “  U 

following  [59]  we  choose  K\  =  \/l0, 


max.  ,  lPr(^)  ^ 

pr{(jz  =  a) 

maxa<q2<lPr(CT2) 

pr(a2  =  1)  ~  2* 

Since  desideratum  2  is  less  important  than  desideratum  1,  we  have  chosen 

K2  =  10. 
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For  a  =  0.05  this  yields  v  —  2.58,  A  =  0.28,  and  (j)  =  2.85.  For  this  set  of  hyperpa¬ 
rameters  pr (a2  <  1)  =  0.81. 

The  marginal  likelihood  for  Y  under  a  model  Mi  based  on  the  proper  priors  described 
above  is  given  by 


p{Y\^VhXuMi)  = 


rm(^ 


^T{%)\I  +  XiViXflh 
x  (A//  +  (Y-  XiHi)T(I  +  XiViXf)~1(Y  -  Xm))' 


(5.4) 


I'  +  Tl 
2 


where  is  the  design  matrix  and  V*  is  the  covariance  matrix  for  (3  corresponding  to 
model  [85].  The  Bayes  factor  for  Mq  versus  M1?  the  ratio  of  (5.4)  for  i  —  0  and  i  =  1, 
is  then  given  by 


01  ~  (|/  +  XoF0^|J 

X  (Xv  +  (Y-  FiL  ±  Wjffi  -  lY^ 

U u  +  (Y-  Xm)T(I  +  X1V1Xf)-^(Y  -  Xm)) 


(5.5) 


5,3  Model  selection  using  Occam’s  window 

Our  first  way  of  accounting  for  model  uncertainty  starting  from  (5.1)  involves  applying 
the  Occam’s  window  algorithm  [75]  to  linear  regression  models.  Two  basic  principles 
underly  this  approach.  First,  if  a  model  predicts  the  data  far  less  well  than  the  model 
which  provides  the  best  predictors,  then  it  has  effectively  been  discredited  and  should 
no  longer  be  considered.  Thus  the  model  not  belonging  to 


A' = 


ma xi(pr(Mi\D)) 
pr(Mk\D) 


(5.6) 


should  be  excluded  from  (5.1)  where  C  is  chosen  by  the  data  analyst.  Second,  ap¬ 
pealing  to  Occam’s  razor,  we  exclude  models  which  receive  less  support  from  the  data 
than  any  of  their  simpler  submodels.  More  formally  we  also  exclude  from  (5.1)  models 
belonging  to 


(5.7, 

Equation  (5.1)  is  then  replaced  by 

„r(Air,  D)MD|A4)Pr(A4> 

1 '  -  ’  <5'8) 

where 

A  =  A'\B.  (5.9) 

This  greatly  reduces  the  number  of  models  in  the  sum  in  (5.1)  and  now  all  that  is 
required  is  a  search  strategy.  First,  if  a  model  is  rejected  then  all  its  submodels  are 
rejected.  The  second  principle  -  ‘Occam’s  window’  -  concerns  the  interpretation  of 
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the  ratio  of  posterior  model  probabilities  pr(Mi\D) /pr(Mo\D).  Here  model  Mo  is 
a  model  with  one  less  predictor  then  M\.  The  essential  idea  is  shown  in  figure  5.1. 
If  there  is  evidence  for  Mq  than  M\  is  rejected,  but  to  reject  Mq  we  require  strong 
evidence  for  the  larger  model,  M\.  If  the  evidence  is  inconclusive  (falling  in  Occam’s 
window)  neither  model  is  rejected. 


Figure  5.1:  Occam's  window ;  interpreting  the  posterior  odds  for  rested  models. 


These  principles  define  the  strategy.  Typically,  the  number  of  terms  in  (5.1)  is  reduced 
to  fewer  than  25,  and  often  to  as  few  as  one  or  two,  A  description  of  the  algorithm  is 
given  below. 

The  search  can  proceed  in  two  directions:  ‘Up’  from  each  starting  model  by  adding 
variables,  or  ‘Down’  from  each  starting  model  by  dropping  variables.  When  starting 
from  a  model  made  up  of  some  subset  of  variables,  we  first  execute  the  ‘Down’  algo¬ 
rithm.  Then  we  execute  the  ‘Up’  algorithm,  using  models  from  the  ‘Down’  algorithm 
as  a  starting  point.  Experience  to  date  suggests  that  the  ordering  of  these  operations 
has  little  impact  on  the  final  set  of  models.  Let  A  and  C  be  subsets  of  model  space 
M ,  where  A  denotes  the  set  of  ‘acceptable’  models  and  C  denotes  the  models  un¬ 
der  consideration.  For  both  algorithms  we  begin  with  A  =  0  and  C  =  set  of  starting 
models. 


•  Down  algorithm 

1 .  Select  a  model  M  from  C 


2. 

3. 

4, 


C  <-  C  -  {M}  and  A  <-  A  +  {M} 

Select  a  submodel  Mq  of  M  by  removing  a  variable  from  M 


Compute 


B  =  log 


/  pr(Mp\D)\ 
V pr(M\D)J 


5.  If  B  >  0R  then  A  *-  A  -  {M}  and  if  M0  &  C,  C  <-  C  +  {M0} 

6.  HOl<  B  <  Or  then  if  Mq  g  C,C  *—  C  +  {Mq} 

7.  If  there  are  more  submodels  of  M,  go  to  3 

8.  If  C  7^  0,  go  to  1 


•  Up  algorithm 

1 .  Select  a  model  M  from  C 


2. 

3. 

4. 


C^C-  {M}  and  A  <—  A  +  {M} 

Select  a  supermodel  Mi  of  M  by  adding  a  variable  to  M 


Compute 


B  =  log 


(pr{M\D)\ 

\pr(Mi\D)J 
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5.  If  B  <  Oi  then  A  <—  A  -  [M]  and  if  Mi  ^  C,  C  <—  C  +  {Mi} 

6.  If  Ol  <  B  <  Or  then  if  Mi  $  C,  C  <—  C  +  {Mi} 

7.  If  there  are  more  supermodels  of  M,  go  to  3 

8.  If  C  ^  0,  go  to  1 

Upon  termination,  A  contains  the  set  of  potentially  acceptable  models.  Finally,  we 
remove  all  the  models  which  satisfy  (5.7),  where  1  is  replaced  by  exp(O^),  and  those 
models  for  which 

maxi(pr(Mi\D)) 

pr(Mk\D) 

A  now  contains  the  acceptable  models. 


5.4  Markov  chain  Monte  Carlo  model  composition 

Our  second  approach  it  to  approximate  (5.1)  using  the  Markov  chain  Monte  Carlo 
model  composition  approach  [76].  This  generates  a  stochastic  process  which  moves 
through  model  space.  Specifically,  let  M  denote  the  space  of  models  under  consid¬ 
eration.  We  can  construct  a  Markov  chain  {M(t)y  t  =  1,2,**}  with  state  space  M 
and  equilibrium  distribution  pr(Mi\D).  If  we  simulate  this  Markov  chain  for  t  = 
1,  •  *  *  ,  TV,  then  under  certain  regularity  conditions,  for  any  function  g{Mi)  defined  on 
M ,  the  average 

1  N 

G  =  -'£g(M(t)),  (5.10) 

V  t- 1 

is  a  simulation  -  consistent  estimate  of  E(g(M))  [105].  To  compute  (5.1)  in  this  fash¬ 
ion  set  g(M)  =  pr(A|M,  D). 

To  construct  the  Markov  chain  we  define  a  neighborhood  nbd(M)  for  each  M  G  Ai 
which  consists  of  the  model  M  itself  and  the  set  of  models  with  either  one  variable 
more  or  one  variable  fewer  than  M.  Define  a  transition  matrix  q  by  setting  q(M  — ► 
M')  =  0  for  all  Mf  $  nbd(M)  and  q(M  — >  Mf)  constant  for  all  M'  G  nbd(M).  If 
the  chain  is  currently  in  state  M,  we  proceed  by  drawing  M'  from  q(M  <—  M').  It  is 
then  accepted  with  probability 


min 


pr(M'\D)\ 
pr(M\D)  )  ' 


Otherwise  the  state  stays  in  M. 


5.5  Freedman’s  paradox  resolved 

Linear  regression  models  are  frequently  used  even  when  little  is  known  about  the  re¬ 
lationship  between  the  predictors  and  the  response.  In  [35]  it  was  shown  that  in  the 
extreme  case  where  there  is  no  relationship  between  the  predictors  and  the  response 
variable,  omitting  the  predictors  with  the  smallest  ^-values  (e.g,  p  >  0.25)  can  result  in 
a  model  with  a  highly  significant  F  statistic  and  high  M2.  We  will  refer  to  this  unfortu¬ 
nate  phenomenon  as  ‘Freedman’s  paradox’.  In  contrast,  if  the  response  and  predictors 


TNO  report 


TNO-DV1  2004  A234 


101 


are  independent,  Occam’s  window  typically  indicates  the  null  model  only,  or  as  one  of 
a  small  number  of  ‘best’  models. 

As  in  [35],  we  generated  5100  independent  observations  from  a  standard  normal  dis¬ 
tribution  to  create  a  matrix  with  100  rows  and  51  columns.  The  first  column  was  taken 
to  be  the  dependent  variable  in  a  regression  equation  and  the  other  50  columns  were 
taken  to  be  the  predictors.  Thus  the  predictors  are  independent  of  the  response  by  con¬ 
struction.  For  the  entire  data  set,  the  multiple  regression  results  were  as  follows 

•  R2  =  0.55,  p  =  0.29, 

•  18  coefficients  out  of  50  were  significant  at  the  0.25  level,  and 

•  4  coefficients  out  of  50  were  significant  at  the  0.05  level. 

Three  different  variable  selection  procedures  were  used  on  the  simulated  data.  The 
first  of  these  was  the  method  used  in  [35].  Here  all  predictors  with  p- values  of  0.25  or 
lower  were  included  in  a  second  pass  over  the  data.  The  results  of  this  method  were  as 
follows 

•  R2  =  0.40,  p  =  0.0003, 

•  17  coefficients  out  of  18  were  significant  at  the  0.25  level,  and 

•  10  coefficients  out  of  18  were  significant  at  the  0.05  level. 

These  results  are  highly  misleading  as  they  indicate  a  definite  relationship  between  the 
response  and  the  predictors,  whereas,  in  fact,  the  data  are  all  noise. 

The  second  model  selection  method  used  on  the  full  data  set  was  Efroymson’s  stepwise 
method  [78].  This  method  indicated  a  model  with  15  predictors  with  the  following 
results 

•  R2  =  0.40,  p  =  0.0001, 

•  all  15  were  significant  at  the  0.25  level,  and 

•  10  coefficients  out  of  15  were  significant  at  the  0.05  level. 

Again  a  model  is  chosen  which  misleadingly  appears  to  have  a  great  deal  of  explana¬ 
tory  power. 

The  third  variable  selection  method  was  Occam’s  window.  The  only  model  chosen  by 
this  method  was  the  null  model.  The  procedure  described  above  was  repeated  10  times 
with  similar  results.  In  5  simulations,  Occam’s  window  chose  only  the  null  model.  For 
the  remaining  simulations  3  models  or  fewer  were  chosen  along  with  the  null  model. 
For  the  non-null  models  that  were  chosen,  all  models  had  R2  values  less  than  0.15.  For 
all  of  the  simulations  the  selection  procedure  used  in  [35]  and  Efroymson’s  stepwise 
method  chose  models  with  many  predictors  and  highly  significant  R2  values. 


Table  5.7;  Log  predictive  scores  for  the  Freedman  simulated  noise  data. 


Model 

Method 

log  predictive  score 

null  model 

Occam’s  window 

133 

18  predictors 

Freedman  [35] 

174 

15  predictors 

Efroymson 

181 

To  compare  the  predictive  performance  of  the  models  chosen  by  the  three  methods, 
another  data  set  with  100  rows  and  51  columns  was  simulated  and  log  predictive  scores 
were  calculated  (see  table  5.1) 
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The  measure  of  predictive  ability  is  the  logarithm  scoring  rule  described  in  [42]  which 
is  based  on  the  conditional  predictive  ordinate  [39].  Specifically,  we  measured  the 
predictive  ability  of  an  individual  model,  M,  with 

-  E  log(pr(d\M,DT)). 
d&D\DT 

We  measured  the  predictive  performance  for  model  averaging  with 

-  log  (  pr(d\M,  DT)pr(M\DT)J  , 

deD\DT  \M€A  ) 

where  for  Occam’s  window  A  is  the  set  of  selected  models  and  for  Markov  chain 
Monte  Carlo  model  composition  A  is  the  set  of  visited  models.  The  log  predictive 
score  for  the  only  model  selected  by  Occam’s  window  (the  null  model)  is  considerably 
better  then  the  log  predictive  score  for  the  models  chosen  by  the  other  two  methods. 
In  addition,  the  mean  square  predictive  error  was  calculated.  The  mean  square  pre¬ 
dictive  error  for  Freedman’s  method  was  1.4  and  the  mean  square  predictive  error 
for  the  Efroymson  model  was  1.5  while  the  mean  square  predictive  error  for  the  null 
model  was  0.9.  Thus  Occam’s  window  has  considerably  greater  out-of-sample  predic¬ 
tive  power  than  the  more  standard  variable  selection  methods  considered. 

At  best,  Occam’s  window  correctly  indicates  that  the  null  model  is  the  only  model 
that  should  chosen  when  there  is  no  signal  in  the  data.  At  worst,  Occam’s  window 
chooses  the  null  model  along  with  several  other  models.  The  presence  of  the  null  model 
among  those  chosen  by  Occam’s  window  should  indicate  to  a  researcher  that  there  may 
be  evidence  for  a  lack  of  signal  in  the  analyzed  data.  Thus  Occam’s  window  largely 
resolves  ‘Freedman’s  paradox’. 


5*6  Discussion 

In  [28]  the  problem  of  assessing  model  uncertainty  has  also  been  addressed.  This  ap¬ 
proach  was  based  on  the  idea  of  model  expansion,  i.e.,  starting  with  a  single  reasonable 
model  chosen  by  a  data-analytic  search,  expanding  model  space  to  include  those  mod¬ 
els  which  are  suggested  by  context  or  other  considerations,  and  then  averaging  over 
this  model  class.  In  [28]  the  problem  of  model  uncertainty  in  variable  selection  is  not 
addressed  directly.  However,  one  could  consider  Occam’s  window  to  be  a  practical 
implementation  of  model  expansion. 

In  [41]  the  stochastic  search  variable  selection  method  similar  in  spirit  to  Markov  chain 
Monte  Carlo  model  composition  has  been  developed.  Here  a  Markov  chain  is  defined 
which  moves  through  model  space  and  parameter  space  at  the  same  time.  To  make  the 
chain  irreducible,  however,  their  method  never  actually  removes  a  predictor  from  the 
full  model,  but  only  sets  it  close  to  zero  with  high  probability.  If  this  probability  is 
very  high,  the  algorithm  has  convergence  difficulties,  and  if  not  the  results  can  be  hard 
to  interpret.  Our  new  approach  avoids  this  problem  by  integrating  analytically  over 
parameter  space. 
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The  prior  distribution  for  the  covariance  matrix  for  /?  depends  on  the  actual  data,  in¬ 
cluding  both  the  dependent  and  independent  variables.  A  similar  data  dependent  ap¬ 
proach  to  the  assessment  of  the  priors  was  used  in  [84].  While  this  may  appear  at  first 
sight  to  be  contrary  to  the  idea  of  a  prior,  our  objective  was  to  develop  priors  that  lead  to 
posteriors  similar  to  those  of  a  person  with  little  prior  information.  Examples  analyzed 
to  date  suggest  that  this  objective  was  achieved.  The  priors  for  (3  lead  to  a  reasonable 
prior  variance  and  result  in  conclusions  that  are  not  highly  sensitive  to  the  choice  of 
hyperparameters.  Thus  the  data  dependence  does  not  appear  to  be  a  drawback. 

The  choice  of  which  procedure  to  use  -  Occam’s  window  or  Markov  chain  Monte 
Carlo  model  composition  -  will  depend  on  the  particular  application.  Occam’s  window 
will  be  most  useful  when  one  is  interested  in  making  inferences  about  the  relationship 
between  variables.  Occam’s  window  also  tends  to  be  much  faster  computationally. 
Markov  chain  Monte  Carlo  model  composition  is  the  better  procedure  to  choose  when 
the  goal  is  good  predictions  or  if  the  posterior  distribution  of  some  quantity  is  of  more 
interest  than  the  nature  of  the  ‘true’  model  and  if  computer  time  is  not  a  critical  con¬ 
sideration.  However,  each  approach  is  flexible  enough  to  be  used  successfully  for  both 
inference  and  prediction. 
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6.  Making  predictions  reliable 


According  to  the  MDL  principle,  models  of  data  are  always  probabilistic;  if  a  class  of 
non-probabilistic  models  is  used  to  model  the  data  at  hand,  it  is  first  mapped  to  a  cor¬ 
responding  probabilistic  class.  On  the  other  hand,  these  models  are  to  be  interpreted  as 
codes  for  the  data  -  not  as  traditional  probability  distributions  according  to  which  the 
data  are  generated.  This  raises  the  question  of  what  conclusions  (predictions)  about  fu¬ 
ture  data  can  and  what  conclusions  cannot  be  drawn  on  the  basis  of  such  ‘probabilistic’ 
models.  The  question  becomes  all  the  more  difficult  if  we  acknowledge,  in  line  with 
the  MDL  philosophy,  that  our  models  will  always  be  partially  wrong  -  even  if  they 
allow  us  to  substantially  compress  the  data. 

In  this  chapter,  we  identify  conditions  under  which  a  probabilistic  model,  inferred  from 
the  data,  can  be  used  to  reliably  predict  future  data  even  if  that  model  is  really  a  prob¬ 
abilistic  representation  of  a  non-probabilistic  model  and/or  if  the  model  is  wrong.  We 
show  that  given  a  model  class  M  with  a  fixed  number  of  parameters  (M  is  not  nec¬ 
essarily  probabilistic)  and  an  error  function  ER,  we  can  turn  M  into  a  probabilistic 
version  (M )ER  that  is  essentially  equivalent  to  M  except  that  it  leads  to  ‘reliable’  es¬ 
timates  of  the  error  function.  We  call  (A1}ER  the  entropification  of  M.  Entropification 
stands  at  the  basis  of  the  main  results  of  this  chapter  (theorems  6.1  -  6.3)  which  can 
summarized  as  follows: 

1.  Under  the  assumption  that  the  data  are  i.i.d.  according  to  an  essentially  arbi¬ 
trary  unknown  ‘true’  probability  distribution  P*,  we  can  infer  from  a  large 
enough  data  set  D ,  with  high  probability,  a  model  9  in  (M  )ER  that  is 

(a)  the  optimal  model  in  (A4)ER  for  predicting  future  data  against  error  func¬ 
tion  ER.  Among  the  models  in  (M }ER,  9  minimizes  the  ‘true’  expected 
error  Ep*  (er(F|0,  X)),  and 

(b)  can  be  ‘reliably’  used,  since  it  gives  a  truthful  impression  of  its  own 
performance  in  the  sense  that  Eq(er(Y\9,  X ))  =  Ep *  (er(F \9,  X)). 

Essentially,  this  means  that  whenever  the  assumption  that  the  data  are  i.i.d.  can  be 
justified  and  the  function  ER  according  to  which  errors  will  be  measured  is  known, 
the  model  9  can  be  used  (1)  to  arrive  at  optimal  predictions  (relative  to  the  model 
class  (-M)er)  of  future  data  against  ER  and  (2)  as  an  accurate  estimator  of  how  good 
predictions  will  be  -  even  if  M  is  a  wrong  (‘misspecified’)  model  class  that  does  not 
contain  any  model  that  is  similar  to  the  ‘true’  P*. 

While  9  can  be  inferred  from  data  by  many  statistical  inference  procedures  (not  nec¬ 
essarily  MDL),  the  ‘entropification’  of  M  turns  out  to  yield  additional  results  when 
combined  with  MDL,  leading  to  the  other  two  important  results  of  this  chapter 

2.  Entropification  removes  an  inherent  arbitrariness  in  MDL’s  trade-off  between 
error  and  model  complexity  that  occurs  when  non-probabilistic  model  classes 
are  used  (section  6.2),  and 

3.  Entropification  allows  us  to  associate  codes  with  non-probabilistic  model  classes 
in  an  optimal  manner,  in  the  sense  that  the  ‘worst  case  expected  code  length’ 
is  minimized  (proposition  6.8). 
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It  is  well-known  that  the  modeling  error  using  the  normal  distribution  with  varying  a2 
works  even  when  the  errors  are  not  truly  normally  distributed  [13],  so  this  far,  there  is 
nothing  new  here.  Our  own  contribution  lies  in  the  fact  that  we  consider  the  general 
case  of  (almost)  arbitrary  error  functions  ER  and  model  classes  M.  We  will  give  a 
recipe  of  how,  given  M  and  ER,  one  can  define  a  new  probabilistic  class  (A4)ER  that 
has  some  special  properties.  We  call  the  model  class  (M)ER  the  entropification  of  M 
with  respect  to  ER.  (Ad)ER  is  constructed  from  M  by  adding  a  single  extra  real- value 
parameter  (3  as  part  of  the  hypotheses:  models  in  {A4)ER  are  indexed  by  parameters 
9  =  for  some  H  e  M  and  f3  E  M.  If  (H,  /3)  is  inferred  from  data  D,  then 

the  (3  associated  with  H  can  be  interpreted  as  a  reliable  estimate  of  the  error  H  will 
make  on  future  data.  (3  will  also  determine  the  entropy  of  the  model  ( H ,  /?),  hence  the 
name  ‘entropification’.  We  give  three  examples  corresponding  to  three  often  used  error 
functions. 

•  If  M  is  a  class  of  continuous  functions  and  ER  is  the  squared  error,  then  (A/f  )ER 
turns  out  to  be  equivalent  to  the  class  {P(-|P,  cr2,  *)| H  E  M\ a2  >  0}  where 

<6-i) 

If  (//,  cr2)  is  inferred  from  D ,  then  a2  can  be  interpreted  as  an  estimate  of  the 
squared  error  H  will  make  on  future  data. 

•  Let  E  be  a  sample  space.  If  M  is  a  class  of  concepts  (functions  mapping  Ex 
on  Ey  =  {0, 1})  and  ER  is  the  0/1-error,  defined  by 


ERoi(y|tf,  x) 


0  if  H(x )  =  y 
1  otherwise 


(6.2) 


then  {M)Er  is  equivalent  to  a  class  of  distributions  {P(-\H,  9 ,  -)|P  E  Ad;  0  < 
0  <  1}  where 

Ep(.|ha.)(er(F|F,X))  =  0. 

9  can  be  interpreted  as  the  probability  that  H(X)  ^  Y.  If  ( H ,  6)  is  inferred 
from  Z),  then  9  can  be  interpreted  as  an  estimate  of  the  0/1-error  that  H  will 
make  on  future  data. 

•  Let  M  be  a  model  class  that  is  finitely  parameterized  by  some  T.  If  M  is  a 
class  of  probabilistic  models  {P(-|r/)|?7  E  T}  and  ER  is  the  logarithmic  error, 
then  (A^)er  will  turn  out  to  be  equivalent  to  a  class  {P('|r?,  (3)\r)  eT]/3  E  R}, 
where 


P(vhH)  =  =  ^plw- 

If  (77,  (3)  is  inferred  from  D,  then  (3  can  be  interpreted  as  an  estimate  of  the 
logarithmic  error  that  rj  will  make  on  future  data. 

This  chapter  is  organized  as  follows.  In  section  6.1  we  formally  introduce  the  concept 
of  ‘entropification’,  present  some  of  its  basic  properties  and  give  some  examples  of  its 
use  and  present  our  main  results  (item  1  in  the  listing  above).  In  section  6.2  we  show 
how  entropification  can  be  used  in  the  context  of  MDL. 
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6.1  Entropification  of  a  model  class 

In  this  section  we  formally  introduce  the  concept  of  ‘entropification’  and  give  some 
examples  of  its  use  and  present  our  main  results.  We  will  assume  that  we  use  some 
reasonable  inference  procedure  that,  for  those  sequences  a?i,  #2,  *  •  * ,  there  really  is 
‘something  to  infer*,  and  that  it  is  guaranteed  to  work  for  large  enough  samples. 

Definition  6.1 

Let  M  be  a  model  class  that  is  finitely  parameterized  by  some  T.  Let  Cm  be  an  esti¬ 
mation  procedure  that ,  for  each  n,  xn  E  En  outputs  an  estimator  9(xn)  E  T.  We  call 
Cm  reasonable  if  for  every  sequence  x\ ,  x<i,  *  •  *  for  which  the  maximum  likelihood 
estimate  9(xn)  converges  to  some  value ,  9(xn)  converges  to  that  same  value;  that  is 

if  limn_>oo  9(xn)  exists  and  is  equal  to  9fut  E  T 
then  limn-too  9(xn)  must  also  exist  and  be  equal  to  9fut. 


Apparently,  some  aspects  of  the  data  can  be  reliably  predicted  on  the  basis  of  the  max¬ 
imum  entropy  model  while  others  cannot.  We  will  now  define  the  notions  of  reliable 
estimations  and  decisions.  The  following  definition  captures  the  idea  of  reliable  esti¬ 
mation.  In  the  definition,  int(U)  stands  for  the  interior  of  the  set  U. 

Definition  6.2 

Let  M  be  a  class  of  probabilistic  models  over  E  parameterized  by  some  T.  Let  ip  : 
E  — >  U  be  a  given  function.  If,  for  all  n ,  all  xn  E  En  with  9(xn)  E  int(T),  we  have 

Ee{xnp{X))=WT,  (6-3) 

and,  moreover,  E^(ip(X))  is  a  continuous  function  of  9,  then  we  say  that  averages  of 
ip  can  be  reliably  estimated  on  the  basis  of  M.  Otherwise  we  say  that  averages  of  ip 
cannot  be  reliably  estimated  on  the  basis  on  M. 

We  only  consider  the  case  where  M  can  by  parameterized  by  a  fixed  number  of  pa¬ 
rameters  k.  This  is  formalized  in  the  following  definition: 

Definition  6.3 

Let  M  be  a  class  of  probabilistic  models  over  sample  space  E  and  let  T  C  Rfc.  We  say 
that  M  is  finitely  parameterized  by  T  if 

1.  there  exists  a  bijection  g  :T  M, 

2.  D  E  E*  is  arbitrary  but  fixed,  then  P(D\9 )  as  a  function  of  9  is  the  restriction 
to  domain  T  of  a  continuous  function  ip  :  Rk  — »  R, 

3.  E  is  continuous,  then  for  all  n,  the  density  function  f(xn\9)  as  a  function  of 
xn  is  continuous  at  each  xn  in  the  interior  of  En. 

Throughout  the  remainder  of  this  chapter  we  assume  that  all  error  functions  ER  consid¬ 
ered  are  sufficiently  regular.  More  precisely,  let  E  =  Ex  x  Ey ,  and  let  ER  :  E  x  M  — >  3R 
be  an  error  function  defined  as  follows 
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Definition  6.4 

Let  M  be  a  set  and  E  be  a  sample  space .  An  error  function  for  M  is  a  total  function 
ER  :  E  x  M  — »  U  where  U  C  R.  For  x  E  E  and  H  E  A4,  we  write  ER(X\H) 
rather  than  ER(x,  H).  We  restrict  ourselves  to  additive  error  functions:  ER  is  extended 
to  outcomes  xn  E  En  by  ER(xn\H)  =  Ya=i  ER(xt|i?)- 

Our  assumption  is  that,  for  all  fixed  H  E  M,  ER(y|if,  x)  considered  as  a  function  of  x 
and  y  can  be  used  to  define  a  maximum  entropy  model  class.  Let  <f>  =  (<^i,  •••  ,  <^m)  be 
a  function  with  domain  E  and  range  U  —  Ui  x  •  •  *  x  Um,  and  let  t  =  (£*,  •  *  •  ,  tm)  E  U. 
We  require  the  constraint  E(<j)(X))  =  t  to  be  such  that  for  i  =  1,  •  •  •  ,  m.  Specifically, 
we  assume  that  for  H  E  M ,  the  function  (j>H  over  E^  x  Ey  defined  by  4>h{xi  y)  = 
ER(y\H,x)  satisfies  the  following  conditions: 

Cl  Ui  is  the  smallest  interval  in  R  such  that  Vx  E  E:  fax)  E  Ui, 

C2  If  E  is  continuous,  then  fa  is  continuous.  More  precisely,  if  E  C  Rk,  then  fa 
is  the  restriction  to  domain  E  of  some  continuous  function  fa  Rk  R, 

C3  In  the  discrete  case  E  contains  a  finite  number  of  elements.  In  the  continuous 
case  E  can  written  as  Ei  x  •  •  •  x  Ej  for  some  l  >  1,  where  for  each  E?-  with 
1  <  3  <1 

1 .  E j  is  a  closed  interval  in  R,  or 

2.  E j  =  R  and  there  exist  a  >  0  and  C  eR  such  that 

Vxi,  •  •  •  yxi  :  fa(xi,  •  •  •  ,xj,  •  •  •  ,x{)  >  \xj\a  -  C. 

This  ensures  that  the  maximum  entropy  model  class  for  (/>h(xi  y)  exists.  But  we  need 
something  stronger  than  merely  the  guaranteed  existence  of  this  class,  as  we  will  now 
explain. 

Throughout  this  chapter,  we  consider  two  cases.  In  the  first  case,  the  hypotheses  class 
M  contains  models  relating  to  Ey  and  not  Ex  (for  example,  each  H  E  M  is  itself  a 
probabilistic  model  over  Ey  or  each  H  is  a  relation  over  Ey  not  involving  Ex).  In  such 
a  case  Ex  does  not  really  play  any  role  and  we  could  have  equally  well  set  E  =  Ey. 
In  this  situation,  the  entropification  of  a  model  class  M  with  respect  to  error  function 
ER  :  E  x  M  — >  R  is  the  class  of  probabilistic  models  containing,  for  each  H  E  M, 
the  class  of  distributions  /3)  defined  by 

P(y\H, P)  =  z~{p)  exP(_^ER^li:r))’  <6-4) 

where  Zh([3)  is  a  normalizing  factor 

Zh(P)  =  exp(-/?ER(y|//)),  (6.5) 

vefcy 

and  ranges  over  all  f3  E  R  for  which  P(y\H,  (3)  is  well-defined.  As  can  be  seen,  the 
distributions  (6.4)  are  formally  equivalent  to  maximum  entropy  distributions. 

The  formal  definition  of  entropification,  we  give  below,  unifies  the  unconditional  case 
with  the  more  complicated  conditional  or  supervised  case.  In  the  latter  case,  we  assume 
that  the  outcomes  x  E  Ex  do  play  a  role,  and  we  are  interested  in  the  conditional  ver- 
sion  of  (6.4),  P(y\H,P,x)  =  Z^x((3)  exp(-0ER(y\H,  x))  with  ZH>X((3)  =  Y.yeEy 
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exp(-/?ER(i/|i?,  x)).  However,  all  our  results  will  only  hold  if  the  resulting  distribu¬ 
tions  are  still  ‘essentially’  maximum  entropy  distributions.  For  this  reason,  we  must 
additionally  assume  that  the  following  conditions  hold 

C4  Ex  is  either  finite  or  compact, 

C5  ER  is  such  that  for  each  fixed  H  and  each  fixed  (3  E  R,  Z}j^x{(3)  is  either  equal 
for  all  x  E  Ex  or  it  diverges  for  all  x  E  Ex. 

C5  turns  out  to  hold  for  most  error  functions  ER  of  interest.  These  include  error  func¬ 
tions  like  the  squared  and  0/1-error.  C5  allows  us  to  drop  the  subscript  x  in 
and  write,  for  arbitrary  x  E  Ex 

Zh{P)  =  Zj  exp(-/?ER(y|ii>)).  (6.6) 

yGEy 

In  the  remainder  of  this  chapter,  we  tacitly  assume  ER  to  satisfy  conditions  C1-C5, 
omitting  the  divergent  case.  We  are  now  ready  to  define  entropification  formally. 


Definition  6.5  (Entropification) 

LetE  —  Ex  x  Ey .  The  entropification  of  a  mode]  class  M  with  respect  to  error  function 
ER  :  E  x  M  — »  E  is  the  class  of  (conditional)  probabilistic  models 


(M) ER  =  {P(-\e,  .)|0  =  (H,  /?);  H  G  M;  fi  €  Tnat(H)}.  (6.7) 


Here  P(*|0,  •)  =  •)  is  a  conditional  model  defined  as  follows: 

L  For  each  x  E  E^,  P{-\0,  x )  =  x)  is  a  probabilistic  distribution  over 

Ey  defined  by 


P(y\H,PlX)  = 


Zh((3) 


exp(-/3ER(y \H,  x))  for  all  y  E  Ey.  (6.8) 


Here  Zh{P)  is  as  in  (6.6). 

2.  For  all  (*",  yn)  E  En,  P(yn\H ,  0,  xn)  =  Uti  &  **)• 

For  each  H,  the  set  of  0,  such  that  P(-\H ,  /?,  •)  E  (A/1)ER,  is  given  by 


r nat(H)  =  {fi\ ZH((3)  <  oo}, 


where  Tnat  is  the  natural  parameter  space. 


If  E  is  continuous,  the  sum  in  Zh(0)  gets  replaced  by  the  corresponding  integral. 
Zh{(3)  acts  as  normalizing  constant. 

The  code  corresponding  to  P(-|iT,  /?,  •)  leads  to  the  following  code  lengths  (expressed 
in  nats  [20]) 

n 

L(yn\H,  (3,xn)  =  -HP(yn\H,p,xn))  =  pY,™(yn\H’xn)  +  nl*(ZHm. 

2=1 

(6.9) 

We  see  that  the  code  length  of  yn  given  H ,  /?,  xn  contains  an  error  term  and  a  ‘uniform’ 
term  n  In (Zh((3))  that  grows  linearly  in  n  and  is  equal  for  all  yn.  This  shows  that  / 3  can 
be  interpreted  as  determining  how  strongly  the  error  should  be  weighted  in  the  code 
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length  corresponding  to  hypotheses  (H,/3).  The  extreme  case  /?  =  0  corresponds,  for 
each  Xi ,  to  the  uniform  distribution  over  all  outcomes  in  Ey.  For  fixed  H ,  in  the  limit  for 
/?  —>  oo,  the  probability  under  ( H ,  /?)  of  an  outcome  yn  given  xn  with  ER(yn|P,  xn)  > 
0  becomes  0. 

Example  6.1 

Consider  the  fitting  of  polynomials.  Let  data  D  =  ( xn ,  yn)  be  given.  In  a  non-proba- 
bilistic  approach  to  this  problem,  we  would  use  some  algorithm  that,  for  each  Dy  when 
input  D ,  outputs  a  polynomial  H  that  it  regards  as  an  optimal  hypotheses  for  D.  Such  a 
polynomial  H  in  itself  does  not  give  any  information  on  how  good  it  will  be  on  future 
data,  and  this  can  be  problematic.  For  example,  imagine  that  a  company  uses  some 
sophisticated  tool  to  infer  H  from  lots  of  data,  and  then  sells  H  to  a  client  so  that  the 
client  can  use  it  to  predict  future  data  ( H  may,  for  example,  be  a  model  for  some  data 
from  the  stock  exchange  and  the  client  may  use  it  as  a  guideline  for  future  investments). 
If  the  company  only  gives  H  to  the  client,  then  the  client  has  no  means  of  knowing  how 
well  H  actually  will  predict  future  data.  This  can  be  most  easily  demonstrated  if  we 
imagine  that  the  model  class  M  is  restricted  to  the  class  of  first-degree  polynomials. 
Let  us  denote  D i  a  first  data  set,  D2  a  second  data  set  and  H  an  optimal  first  degree 
polynomial.  Assuming  that  the  company  uses  a  reasonable  (definition  6.1)  method  to 
infer  the  best  polynomial,  it  will  infer  a  polynomial  reasonably  close  to  H  for  both 
data  sets.  However,  if  future  data  behaves  like  present  data,  then  in  the  case  of  £>i,  H 
will  be  a  much  better  predictor  than  in  the  case  of  D2.  The  client  (who  has  not  seen 
the  ‘training’  data)  would  probably  like  to  how  how  good  the  hypotheses  H  «  H 
is  before  it  decides  whether  to  buy  it  or  not;  but  H  does  not  reveal  this  information. 
Therefore,  the  client  may  rather  want  the  company  to  sell  a  tuple  (H,  <r2)  where  a2  is 
some  reasonable  estimate  of  the  error  H  will  make  on  future  data.  In  this  way,  he  will 
get  a  reliable  impression  of  the  performance  of  H .  Now,  let  M  be  a  class  of  continuous 
functions  H  :EX  Ey.  From  the  definition  of  entropification  (definition  6.5)  we  can 
see  (by  substituting  /?  =  (^cr2) )  that  {M)ERsq,  the  entropification  of  M  with  respect 
to  the  squared  error,  is  equivalent  to  the  model  class  that  supplies  M  with  a  normal 
error  distribution  of  arbitrary  variance  cr2  >  0.  Formally 

(M) ERs,  =  {P(  \H, a2,  )| H  eM;a2>  0},  (6.10) 

with  P(-\H,  cr2,  •)  as  given  by  (6.1).  In  this  case,  the  value  of  Zh((3)  is  independent 
of  H  and  finite  for  all  (3  >  0.  The  parameter  space  will  be  Tnat(H)  =  {(3\/3  >  0} 
independently  of  H ,  corresponding  to  all  variances  a2  =  1/(2/?)  >  0.  Note  that, 
Ep(  \H,a2, ')(ERs?(^l^  X))  =  a2.  Hence  when  the  tuple  ( H ,  /?)  is  inferred  from  data, 
then  /?,  which  determines  cr2,  can  be  seen  as  an  estimate  of  the  expected  squared  error 

H.  0 


Example  6.2  (Concept  learning  and  Bernoulli  parameters) 

Let  M  be  a  class  of  concepts  over  E  =  x  {0, 1}  and  let  ERoi  be  the  0/1-error 
function  (see  (6.2)).  Let  the  observed  data  D  =  ( xn ,  yn).  Let  {M)ERqi  be  the  entropi¬ 
fication  of  M  with  respect  to  error  function  ERoi,  and  let  (H)ERqi  =  {P(*|i7,  f3)\/3  £ 
T nat(H)}  be  the  restriction  of  (At)ERoi  to  models  with  fixed  H  £  M.  Substituting 
(3  =  ln(l  -  6)  —  ln($)  in  definition  6.5,  we  find  that  (H)ERqi,  the  class  of  Bernoulli 
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models,  containing  one  model  for  each  probability  of  error 


Ep(m0i(Y\H,X))  =  P0{uro1(Y\H,X)  =  1}  =  Pp{H(X )  #  Y} 


1 

W) 


exp(/3  •  1}  =  0, 


(6.11) 


and 


Ep( |1  -  ERoi(r|H,X)|)  =  Pp{BR01(Y\H,X)  =  0}  =  Pp{H(X)  =  Y} 

=  ^exp(-p-0)  =  l-e.  (6.12) 

It  follows  that  the  class  {M)ERqi  can  equivalently  be  parameterized  as  (Ad)ERol  = 
{P(-|iif,  0,  -)| H  E  Ad,  0  <  0  <  1},  such  that,  if  ERoi(yn\H,xn)  =  A:,  then 

P(yn|tf,  0,  xn)  =  0fc(l  -  6)n-k.  (6.13) 

This  expresses  that  the  probability  of  error  of  H  is  equal  to  0  for  each  observation, 
independently  of  any  other  observations.  Equation  (6.11)  shows  that,  if  (H,  (3)  is  in¬ 
ferred  from  data  D ,  then  (3  (which  determines  6)  can  be  interpreted  as  an  estimate  of 
the  expected  0/1-error  of  H ,  which  is  just  the  probability  that  H  misclassifies  D.  Just 
as  above,  (3  serves  to  estimate  the  expected  (in  this  case,  0/1-)  error.  0 

MDL  is  usually  applied  to  concept  classes  in  a  way  that  does  not  involve  entropification 
[83].  In  example  6.5,  we  show  that  the  ‘traditional’  way  of  applying  MDL  to  a  concept 
class  M  is  essentially  equivalent  to  applying  MDL  to  the  probabilistic  class  (Ad)ER, 
thus  reconciling  the  two  views. 

Example  6.3  (Entropification  of  probabilistic  model  classes) 

What  happens  if  we  try  to  entropify  a  probabilistic  model  class  Ml  For  simplicity,  we 
only  consider  the  case  where  M  =  {P(’\v)\v  £  T^}  is  a  class  of  i.i.d.  probabilistic 
models  over  E^.  Similarly,  we  consider  only  error  functions  ER  :  Ey  x  M  — ►  R. 
The  values  of  Xi  are  therefore  irrelevant  and  M  consists  of  full  rather  than  conditional 
probability  distributions.  Definition  6.5  is  seen  to  simplify  in  this  case  to 

(M) ER  =  {P(-|(7?,/3));j7  G  Tm;(3  e  Tnat(r])},  where 

P(y\(v,P ))  =  £^exP(~'0ER(yW)> 

n 

P{yn\{n,P))  =  \[P{yi\{ri,l3)).  (6.14) 

i= 1 

A  natural  error  function  for  probabilistic  models  is  the  logarithmic  error  ERig(yn\rj)  = 
—  Yh= i  ln(P(yi|7j)).  Using  this  logarithmic  error,  we  obtain 

P(vhP)  =  eMPHp(v\v)))  =  1 (6.15) 

zn{0)  Pyy\w 


We  consider  two  cases: 
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1 .  M  is  an  exponential  family.  A  fc-parameter  exponential  family  is  a  family  of 
probability  distributions  or  densities  that  can  be  written  in  the  form 

p(x\P)  =  exp(-/?T  •  <p(x))h(x),  (6.16) 

where  Z(f3)  =  E^eE exp{-/3T(j)(x))h(x),  <j>(x)  =  {<j> i(x),  ■  ■  ■  ,<j>k(x)),  and 
/ 3  E  Rk.  (j)(x )  are  functions  defined  for  all  x  E  E.  The  natural  parameter  space 
of  an  exponential  family  is  given  by 

rnat  =  {0  6  Rk\z(p)  <  oo}. 

An  exponential  family  is  said  to  be  full  if  it  contains  a  model  for  every  (3  E 
Tnat.  The  dimension  of  an  exponential  family  is  the  dimension  of  its  associated 
Tnat .  An  exponential  family  is  said  to  be  of  irreducible  dimension  if  there  is  no 
(. k  -  l)-parameter  exponential  family  expressing  the  same  class  of  probability 
distributions.  From  the  point  of  view  of  measure  theory,  the  function  h(x )  may 
be  absorbed  in  a  dominating  measure  [63].  One  can  drop  the  factor  h(x)  from 
(6.16).  We  see  that  maximum  entropy  model  classes  and  exponential  model 
classes  coincide.  If  M  is  a  full  exponential  family  then  (M)ERlg  —  M  as 
can  be  seen  from  substituting  (6.16)  in  (6.15).  If  M  is  an  exponential  family 
that  is  not  full  and  that  contains  a  model  for  some  /3  ^  0,  then  entropification 
serves  to  make  it  full.  We  see  that  full  exponential  families,  and  hence  ‘full’ 
maximum  entropy  model  classes,  are  closed  under  entropification. 

2.  M  is  not  an  exponential  family.  This  case  is  more  interesting.  Many  useful 
probabilistic  model  classes  are  not  of  the  exponential  form;  as  a  simple  ex¬ 
ample  consider  hidden  Markov  models  [60].  For  such  models  classes,  entropi¬ 
fication  can  nevertheless  be  useful,  for  two  reasons:  (1)  it  leads  to  reliable 
estimates  of  the  logarithmic  error  in  the  sense  of  definition  6.2,  and  (2),  using 
{M)ERlg  instead  of  M  can  often  lead  to  additional  compression  of  the  data 
when  data  is  encoded  using  the  MDL  two-part  code.  For  a  discussion  on  MDL 
two-part  code  see  for  example  [92].  0 

We  continue  with  presenting  some  useful  properties  of  entropified  model  classes  {M)  ER. 
These  properties  will  be  used  in  the  proofs  of  our  main  results.  The  key  to  proving  all 
the  properties  is  that  for  each  fixed  H  E  M,  the  subclass  of  model  (i/)ER  contain¬ 
ing  (i/,/3)  for  all  (3  E  Tnat  (i.e.,  (i/)ER  =  {(if, /3)|(i/,/3)  E  (M)ER})  is  essentially 
(tough  not  strictly)  a  maximum  entropy  model  class.  The  reason  that  the  correspon¬ 
dence  is  not  strict  is  that  (i/)ER  is  a  class  of  conditional  models.  This  leads  to  some 
technical,  but  not  essential  complications  in  proving  the  properties. 

We  assume  that  we  are  given  a  sample  space  E  =  Ex  x  Ey,  a  model  class  M  and  an 
error  function  ER.  Briefly  we  will  show  that: 

1 .  even  though  the  distributions  indexed  by  (iZ,  (3)  are  conditional,  one  can  define 
entropy  and  expectation  of  error  with  respect  to  these  distributions, 

2.  entropification  leads  to  ‘reliable’  estimates  of  the  error  (in  the  sense  of  defini¬ 
tion  6.2), 

for  fixed  H ,  the  models  (i/,/3)  are  all,  in  an  important  sense,  equivalent,  but 
they  differ  in  that  they  all  have  different  entropy,  and 
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4.  there  exists  a  particularly  well-behaved  class  of  error  functions  which  we  will 
call  ‘simple’. 

In  proving  them,  we  need  to  use  the  maximum  likelihood  estimator  for  fixed  hypothe¬ 
ses  H  and  similarly,  for  fixed  (3,  which  we  now  define: 

Definition  6.6 

Let  D  =  (xn,  yn )  E  En.  The  maximum  likelihood  estimator  of  D  for  fixed  H  with 
respect  to  {M)Er,  denoted  by  /3(D\H),  is  (if  it  exists)  given  by 

P(D\H)=  max  {P(yn\H,0,xn)}.  (6.17) 

P€rnat{H) 

The  maximum  likelihood  estimator  of  D  for  fixed  (3  with  respect  to  (M)ER,  denoted 
by  H  (D\f3),  is  (if  it  exists)  given  by 

H(D\(3)  =  mBx{P(yn\H,  f3,  xn)}.  (6.18) 


We  assume  that  Zh(/3)  does  not  depend  on  x.  As  will  be  shown  this  implies  that  for 
fixed  H  and  (3,  the  expectation  of  the  error  under  (H,  /?)  is  independent  of  the  given  x . 
The  same  holds  for  the  entropy.  Formally,  we  let  E^/3)  denote  expectation  under  the 
model  P(-\H ,  (3,  •)  E  (M)ER.  Then  for  all  (H,  (3)  e  {A4)ER  and  x\y  x2  E  Ex,  we  have 

E{m(BR(Y\HyX)\X  =  Xl)  =  E{h^(br(Y\H,X)\X  =  x2).  (6.19) 

Also,  for  all  x\y  x2  E  Ex,  the  entropy  H(P(-\H,  / 3 ,  *))  satisfies 

n(P^\H,pix1))  =  H(P('\H,P,x2)).  (6.20) 

They  imply  that  the  expectation  £(#,/?)  (er(F|P,  X))  over  the  conditional  model  (H,  /3) 
supplied  with  an  arbitrary  distribution  Px  over  Ex  does  not  depend  on  Px.  This  allows 
us  to  write  E(H^(br(Y\H,  X))  instead  of  E(H^(er(Y\H,  X)\X  =  x).  Similarly, 
we  will  write  H (tf,  /?)  instead  of  H(P(- |i?,  /?,  x)). 

The  following  proposition  lists  some  very  useful  (and  well-known)  facts  about  maxi¬ 
mum  entropy/exponential  model  classes  that  will  be  used  several  times. 

Proposition  6.1 

Let  Mme  a  maximum  entropy  class  for  the  function  <j)(x)  =  (^i(x),  *  •  *  ,  0m(x)) 
with  range  HJ  =  ILJ^  x  *  *  *  x  Let  (3  —  (/^i)  *  *  *  ?  (3m)  ^  Tnat>  where  Tnaf  is  the 
space  of  parameters  in  the  natural  parameterization  of  Mme.  Let  1  <  i,  j  <  m.  Then, 
we  get 

1.  The  first  two  (central)  moments  of  P(-\(3)  are  determined  by  the  first  two 
derivatives  of  Z((3) 

-L]n(Z(l3))  = -Epifaix)), 

g^Hm)  =  cov(MX),  <t>j(X)) 

=  E((MX)  -  E(UX)))(MX)  -  E(MX)))). 
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2.  Let  fa\p  "  ,  fam  all  be  fixed  except  faim  Then 

(a)  Ep(cf>(X))  as  a  function  of  fa  is  strictly  decreasing. 

(b)  If  fa  >  0  then  the  entropy  H{fa)  is  a  strictly  decreasing  function  of  fa.  If 
fa  <  0,  then  Tt(fa)  is  a  strictly  increasing  function  of  fa. 

3.  The  log-likelihood  ln(P(xn\fa))  as  a  function  of  Pi  is  concave ,  reaching  its 
maximum  at  the  point  where  Ep(fa(X))  =  fa(X)  .  More  generally 

4.  Let  Eq  stand  for  the  expectation  under  the  model  P(-\9)  £  Mme  defined  by 
the  mean-value  parameterization  (i.e.  Ee{(j){X))  —  6).  Assume  that  f{X) 
lies  in  the  interior  of  U.  Then 

Efon){4>{X))  =  %„)(</>(*))  =  W)n  =  k*n)-  (6.21) 

5.  9  {fa)  =  Ep(<j>(X))  as  a  function  of  fa  is  a  continuous  bijection  from  Tnat  to 
int(U). 

Proof  proposition  6.1:  All  of  these  properties  are  straightforward  to  verify  by  differ¬ 
entiation  and  realizing  that  when  we  take  derivatives  of  Z(fa)  we  are  allowed  to  inter¬ 
change  the  order  of  differentiation  and  integration  by  our  regularity  conditions  on  q f>. 
Otherwise,  see  [63].  □ 

Let  Q  and  R  be  two  distributions  over  E  satisfying  E{(j>{X))  —  t.  Let  Pme  be  the 
maximum  entropy  distribution  for  this  constraint.  We  have 

H(Q)  ^  EQ(-ln(Q(X)))  <  EQ(-ln(Pme(X))) 

(1}  Eme(—  ln(Pme))) 

=  H(Pme )  <  Eme(-  In (R(X))).  (6.22) 

If  Q  7^  Pme,  inequality  (2)  becomes  strict;  if  R  ^  Pme  inequality  (5)  becomes  strict. 

Proof  (6.22):  (1)  and  (4)  follow  from  the  definition  of  entropy,  which  is  given  by 

n(P)  =  Ep{LP{x)).  (6.23) 

(2)  and  (5)  follow  from  the  information  inequality,  which  is  given  by 

D{P\\Q)  >  0.  (6.24) 

To  see  that  Pme  indeed  maximizes  the  entropy  subject  to  the  constraint  (f>{x )  =  t,  let 
Q  be  any  distribution  other  than  Pme  satisfying  the  constraint  and  notice  that 

n(Q)  =  EQ(-ln(Q(X)))  <  EQ(-ln(Pme(X))) 

=  EQ(PT(f)(X)  +  In  (Z(p)))  =  /3TEQ(4>(X))  +  In  (Z(f3)) 

® /3Tt  +  ln(Z((3)),  (6.25) 

where  the  inequality  follows  from  the  information  inequality  (6.24)  and  (6)  follows 
from  the  fact  that  we  defined  Q  to  satisfy  the  constraint  and  hence  Eq{^>{X))  =  t.  On 
the  other  hand,  we  see 

w(pme)  —  ^me  (-In  (P(X))) 

—  -®me  (PT<f>(X)  +  ln(Z(P))) 

=  fEme(<t>(X ))  +  ln(Zm 

®  pTt  +  \n(Z(0)), 


(6.26) 
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where  (7)  follows  from  the  fact  that,  by  definition,  the  maximum  entropy  distribution 
satisfies  the  constraint  and  hence  Eme(cj)(X ))  =  t.  Together,  (6.25)  and  (6.26)  give 
that  Ti(Pme)  >  W(Q)  for  all  Q  ^  Pme  satisfying  the  constraint.  Now  (3)  follows 
from  equation  (6.25)  and  (6.26).  (1)  -  (5)  imply  7i(Q)  <  H(Pme)  which  expresses 
the  fact  that  Pme  maximizes  the  entropy.  It  also  implies  that  Eq(—  ln(Pme(X)))  < 
Eme(—  In (R(X)))  which  expresses  the  fact  that  Pme  minimizes  the  worst-case  de¬ 
scription  length.  □ 

Let  M  be  a  class  of  models,  ER  be  an  error  function  and  (.A/f  )ER  be  the  entropification 
of  M.  (H)e r,  the  subclass  of  models  from  M  restricted  to  fixed  H  (i.e.  (H)n R  = 
{(#,  (3)\  (LT,  /?)  e  (M)er})  is  essentially  a  maximum  entropy  model  class.  However, 
since  (H)E R  is  a  class  of  conditional  models,  we  need  to  use  a  trick:  we  extend  (A//)ER 
to  a  class  of  distributions  over  Ex  x  Ey  by  supplying  it  with  the  uniform  distribution 
over  Ex\  this  distribution  exists  since  we  assume  E^  to  be  either  finite  or  compact. 
As  will  be  shown  below,  the  resulting  model  class,  which  we  denote  by  {A4)RR,  is  a 
maximum  entropy  model  class.  We  then  use  standard  results  about  classes  to  prove 
certain  properties  for  (M)ER,  and  we  then  show  that  these  properties  hold  for  (A4)^R, 
they  must  also  hold  for  the  class  of  conditional  distributions  (A1)ER.  This  will  be  done 
in  lemma  6.1  below.  After  having  proved  the  lemma,  we  will  show  that  some  properties 
follow  as  immediate  corollaries  from  this  lemma. 

Lemma  6.1 

Let  E  =  Ex  x  Ey  and  let  ER  :  E  x  M  — »  R  U  {oo}  be  an  error  function.  Let  Pu( *) 
be  the  uniform  distribution  over  Ex.  Let  (M)RR  be  the  class  of  probabilistic  models 
Pu(-)-\HJ/3)  where  for  each  in  M,  for  each  (xn,yn),  Pu(xn,yn\H,  /3)  = 

P(yn\H^  /3)xn)Pu(xn).  We  have 

1.  There  exists  a  constant  c  e  R  such  that  for  all  n ,  all  (xn,  yn)  E  En 

P(vn\P,  H,  xn)  *  =  Pu{xn ,  yn\(3,  H).  (6.27) 

Let,  for  fixed  H  E  M,  (H)%K  =  {(H,/3)\(H,P)  E  (M)%K}  be  the  restriction  of 
(*A^)er  to  w°dels  with  fixed  H . 

2.  (H)Er  *s  the  maximum  entropy  model  class  for  function  cj)(x ,  y)  =  ER(y|i7,  x) 
with  range  U.  Here  U  is  the  smallest  (open  or  closed)  interval  in  R  such  that 
V(x,y)  G  E:  <f>(x,  y)  G  U. 

Letx  be  an  arbitrary  element  of  Ex.  LetP(-\H,  (3,  x)  be  the  distribution  over  Ey  given 
by  P(y\H,  f3, x)  —  Zj/(/3)~1  exp(~ /3ER(y\H,  x))  and  let  H(P(-\H,  /3,  x))  stand  for 
the  entropy  of  the  distribution  P(-\H,  3,  x). 

3.  We  have  for  all  (H,  (3) 


EPuWW)(ZR(Y\H,X))  =  EPmi3tX)(E1l(Y\H,X)\X  =  x) 

=  -— In  (ZH(I3))  (6.28) 


4.  We  also  have  for  all  ( H ,  / 3 ) 


H(P(-\H,  t 3 ,  *))  =  H(PU(;  -| H,  13))  +  In (c). 


(6.29) 
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Proof  lemma  6. 1 :  Item  1  is  straightforward.  Item  2  follows  directly  by  our  assumptions 
on  ER  and  the  definition  of  (A4)ER  (definition  6.5).  To  prove  item  3,  note  that  for  each 
(3,  •)  the  corresponding  unconditional  model  Puf,  -  \H,  f3)  is  given  by 

Pu(x,y\H)P)  =  ~^C  exp(-0ER(y\H,x)). 

Since  {M  )^R  is  maximum  entropy  model  class,  we  have 

EPu^.W)(ER(Y\H,X))  =  ~  (In (ZHm  -  He))  =  —HZam- 

(6.30) 

Now  choose  an  arbitrary  x  G  Ex.  Let  ( H(x))EK  be  the  class  of  models  containing, 
for  each  (3  G  Tnat(H)y  the  distribution  P(  \H,  /?,  x)  defined  as  in  the  statement  of 
item  3  in  the  lemma.  ( H(X ))  is  a  class  of  maximum  entropy  models  for  function 
y )  =  ER(y\H,  x)  (note  that  ^  is  a  function  of  y  only;  x  is  kept  fixed).  This  is 
straightforward  to  verify  by  our  assumptions  on  ER.  We  now  have 

n 

EP(.\H,/3,x)(er(Y\H,X)\X  =  x)  =  __ln(Ztf(/3)).  (6.31) 

(6.30)  and  (6.31)  coincide.  Since  we  picked  x  arbitrarily,  (6.28)  follows.  We  continue 
with  item  4.  By  straightforward  calculation  we  see  that  the  entropy  of  P(-|#,  /3,  x)  is 
equal  to 

0Ep{.\HAx)(er(Y\H,  X)\X  =x)  +  In (Z(P)), 
while  the  entropy  of  P(-,  -|f/,  (3)  is  given  by 

pEP{vW)(ER(Y\H,  X )  +  \n(Z(P))  -  ln(c). 

Together  with  (6.28)  in  item  3  of  the  lemma,  equality  (6.29)  follows.  □ 

We  proceed  to  show  that  entropification  leads  to  reliable  estimates  of  the  error.  This 
will  be  the  key  to  proving  the  theorems  on  entropification  which  we  prove  below.  In 
example  6.1  we  discussed  why  ‘reliability’  is  a  desirable  property. 

Proposition  6.2  (reliability) 

Let  D  =  ( xn,yn ).  E^^(er(Y\H,  X))  is  as  a  function  of  (3,  for  each  H  G  M, 
continuous ;  moreover 

X)  =  ~  ER(yj\H,  Xj).  (6.32) 


This  proposition  shows  that,  for  each  model  (H,  (3{D\H))y  its  expected  error  over 
future  data  is  equal  to  its  average  error  over  the  given  data.  By  definition  6.2,  this 
implies  that  for  each  H  G  My  the  average  error  ER(y\H,  x)  can  be  reliably  estimated 
on  the  basis  of  the  restriction  of  the  class  (A4)ER  to  models  containing  this  specific  H. 

Proof  proposition  6.2:  Let  H  G  M  be  fixed.  By  lemma  6.1,  item  1,  the  probabilistic 
model  Pu(*,  -| H,  (3 )  that  maximizes,  for  fixed  Hy  the  likelihood  of  D  within  the  class 
of  unconditional  models  (#)ER  (as  defined  in  lemma  6.1)  is  indexed  by  the  same  value 
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/3  as  the  model  P(* |ff,  (3,  •)  that  maximizes  the  likelihood  of  D  within  the  class  of 
conditional  models  (ff)E R.  Also  by  lemma  6.1,  (if)  is  a  maximum  entropy  model 
class.  Therefore,  we  have  that  the  expectation  under  the  unconditional  model  indexed 
by  H  and  /3(D\H)  is  equal  to  the  average  over  data  D 

1  71 

^0)  =  ~  er(2/z|P5  xi)‘  (6.33) 

i= 1 

By  lemma  6.1,  item  3,  this  shows  (6.32).  □ 

We  now  state  two  properties  that  explain  why  we  have  chosen  the  name  ‘entropifi- 
cation’.  Let  if  E  M  be  arbitrary  but  fixed.  The  models  (#,/?)  in  (M)  are,  for  all 
( 3  E  Tnat(H)  except  /3  =  0,  partially  equivalent  to  H  as  stand-alone  in  the  sense 
that  they  leave  the  ordering  (in  terms  of  goodness-of-fit)  that  they  impose  on  the  data 
unchanged.  The  ordering  with  respect  to  the  original  error  function  equals  the  new  or¬ 
dering  with  respect  to  the  logarithmic  error.  Yet  the  models  (if,  (3)  are  all  different  in 
the  sense  that  they  all  have  different  entropies. 

Proposition  63 

Let  if  E  M  and  let  xn,  yn  and  zn  be  such  that  ER(yn|if,  xn )  >  ER(zn|if,  xn).  Then 
for  all  (3  E  Tnat(H)  with  (3  >  0 

-ln(P(y"|if,/3,^))  >  —  ln(P(zn|if,  (3,  xn)). 


while  for  all  (3  <  0 

-\n(P(yn\H,(3,xn))  <  -ln(P(zn\H,(3,xn)). 


Proof  proposition  6.3:  Immediate  from  definition  6.5,  □ 

Hence  for  each  H,  entropification  either  leaves  unchanged  or  reverses  the  ordering  in 
terms  of  goodness-of-fit  that  H  imposes  on  the  data:  for  every  /3,  the  ordering  with 
respect  to  ER(  |if,  xn)  is  identical  or  reversed  to  the  ordering  with  respect  to  the  code 
length  (or  ‘logarithmic  error’)  —  ln(P(  |if,  /?)).  For  the  second  property,  let  H (if,  (3) 
denote  the  entropy  of  the  model  P(*|if,  /?,  x)  (for  arbitrary  x)  restricted  to  single  out¬ 
comes  E  y. 

Proposition  6.4 

For  all  f3  E  Tnat(if)  with  (3  >  0,  the  entropy  if  (if,  /3)  is  a  strictly  decreasing  function 
of  /?.  For  all  f3  0,  Ti (if ,  ^)  is  a  strictly  increasing  function  of  {3 . 

Propositions  6.3  and  6.4  tell  us  that  for  fixed  if  and  varying  /?,  the  compound  models 
(if,  (3)  can  all  be  seen  as  ‘versions’  of  if  with  different  entropies. 

Proof  proposition  6.4:  We  know  from  lemma  6.1,  item  4  that  =  ln(c)  + 

7f(Pu(',  *|if,  (3))  for  some  constant  c.  The  second  term  stands  for  the  entropy  of  a 
maximum  entropy  distribution  with  parameter  f3  (lemma  6.1,  item  2).  Since  for  max¬ 
imum  entropy  distributions,  the  entropy  is  a  strictly  increasing  (decreasing)  function 
for  /3  >  0  (/?  <  0),  the  result  follows  (proposition  6.1,  item  2(b)).  □ 
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Suppose  (3  >  0*.  As  (3  increases,  the  entropy  H(H,  (3)  decreases.  One  may  expect  that, 
with  decreasing  entropy  (and  thus  decreasing  ‘inherent  disorder’),  the  expected  error 
^(//,/?)(ER(^l^  ^0)  also  decreases.  This  relation  indeed  holds,  but  only  if  (3  >  0:  if 
(3  <  0,  then  the  entropy  is  an  increasing  function  of  f3  while  the  expected  error  remains 
a  decreasing  function  of  /?.  In  general,  let,  for  fixed  H,  U h  be  the  smallest  (possibly 
unbounded)  interval  in  R  such  that  V(x,  y)  e  E:  ER(y|P,  x)  E  U//. 

Proposition  6.5 

■E(i/,/3)(ERG^ \H,  X))  is  a  strictly  decreasing  function  of  {3.  For  each  t  in  the  interior 
ofUfj  there  exists  a  unique  value  of  (3  such  that  E^h^(er{Y\H,  X))  =  t. 

Proof  proposition  6.5:  We  know  from  lemma  6.1,  item  3  that  for  all  (3  £  Tnat(H ), 
E{H,p){ER(Y\H,X))  =  EPu^^h^(er(Y\H,  X)).  Here,  Pu(-r\H,(3)  (see  lemma 
6.1,  item  2)  is  a  maximum  entropy  distribution  with  natural  parameter  /?.  By  applying 
proposition  6.1,  item  2(a),  which  states  that  EPu(.^H p)(ER(Y\H,  X))  is  a  strictly 
decreasing  function  of  /?,  the  first  part  follows.  The  second  part  immediately  follows 
from  item  5  of  proposition  6. 1 .  □ 

Some  error  functions,  among  which  the  squared  error  and  the  0/ 1-error,  turn  out  to 
have  a  useful  additional  property  which  automatically  makes  them  satisfy  our  regu¬ 
larity  conditions  for  error  functions  and  which  makes  sure  that  entropification  leaves 
relative  ordering  of  hypotheses  in  terms  of  goodness-of-fit  for  given  data  unchanged. 
We  call  such  hypotheses  simple: 

Definition  6.7 

If  ER  is  such  that  for  all  Hi,  H2  6  M  and  all  /?, 

ZhAP)  -  ZhAP), 

where  Zp(/3)  is  defined  as  in  (6.6),  then  we  call  ER  a  simple  error  function  for  M. 


The  two  error  functions  we  have  encountered  earlier  are  both  simple,  as  shown  by  the 
following  proposition. 

Proposition  6.6 

Let  E  =  Ex  x  Ey,  let  M  be  an  arbitrary  class  of  functions  H  :  Ex  — ►  Ey.  If  Ey  =  R 
then  the  squared  error  function  ERsq  is  simple  for  M.  If  Ey  =  {0, 1},  then  the  0/1- 
error  function  ERoi  is  simple  for  M. 

Proof  proposition  6.6:  In  the  squared  error  case 

Zh,x{P)  =  /  exp(-f3msq(y\H,x))dy 

Jy€Ey 

=  f  exp (~P(y  -  H(x))2)dy 

3  y(zEy 

=  Eap, 


Note  that  this  is  always  the  case  in  statistical  mechanics. 
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which  does  not  depend  on  either  H  or  x.  The  case  of  the  0/1-error  is  analogous.  □ 

For  model  classes  entropified  with  simple  error  functions  we  can  drop  the  subscript 
from  Zh((3)  and  simply  write  Z(/3).  By  calculating  the  entropy  H(H,  f3 )  of  the  model 
( H ,  (3)  it  follows  immediately  that  this  entropy  depends  only  on  (3  and  not  on  H .  Sim¬ 
ple  error  functions  have  an  additional  important  property  which  is  dual  to  the  property 
expressed  by  proposition  6.3.  Whereas  in  that  proposition,  we  showed  that  the  ordering 
in  goodness-of-fit  imposed  on  data  by  a  hypothesis  ( H ,  (3)  is  identical  for  all  f3  (up  to 
their  sign),  the  present  result  shows  that  in  the  case  of  simple  error  functions,  a  reverse 
property  also  holds:  the  ordering  in  goodness-of-fit  imposed  on  hypothesis  (H,  (3)  by 
given  data  D  is  identical  for  all  (3  (up  to  their  sign). 

Proposition  6.7 

Let  ER  be  a  simple  error  function  forM  and  let  D  =  ( xn ,  yn).  For  each  pi,  P2  E  rna* 
with  fix,  P2  >  0  or  Pi,  P2  <  0  and  each  Hi,  H2  €  M  we  have 

P(yn\Hi,/3i,xn)  >  P(yn\H2,  f3i,xn)  ^  P(yn\Hlt /%,*")  >  P(yn\H2, (32,xn) 

and,  in  particular,  if  Pi,  p2  >  0  and  if  there  exists  a  unique  H  that  minimizes  the 
empirical  error  ER(yn|iT,  xn),  then 

&{D\(h)  =  H(D\fo)  =  min  {ER(yn|tf,  xn)}.  (6.34) 

Hence  the  H  in  the  tuple  (//,  p)  that  maximizes  the  likelihood  is  independent  of  p 
(except  for  the  sign  of  p)  and  (if  P  >  0)  is  equal  to  the  H  e  M  that  minimizes  the 
empirical  error 

Proof  proposition  6.7:  Immediate  from  instantiating  P(yn\Hi,  Pi,  xn)  using  definition 
6.5  and  the  fact  that  Z(P)  does  not  depend  on  H.  □ 

At  this  point,  we  have  established  all  basic  properties  of  entropification.  We  will  use 
these  properties  to  prove  the  main  results,  concerning  its  behaviour  in  the  i.i.d.  case. 
Before  we  do  this,  we  give  a  summary  of  the  possible  interpretations  of  /3.  We  can 
estimate  P  by  any  statistical  means.  The  parameter  P  as  part  of  the  hypotheses  (H,  p) 
can  be  interpreted  in  the  following  ways: 

1.  P  determines  the  expected  error  (er(F \H,  X)),  hence  ... 

2.  ...  when  (H,  P)  is  inferred  from  data  D ,  p  serves  as  an  estimate  of  the  error  H 
will  make  on  future  data, 

3.  p  determines  the  entropy  Ti(H,P)  (proposition  6.4):  the  closer  \P\  to  0,  the 
larger  Ti(H,  p), 

4.  P  determines  how  strongly  the  error  ER(yn\H,xn)  is  weighted  in  the  code 
based  on  P(-\H,  P,  •)  which  has  lengths  L(yn\H,xn)  =  PER(yn\H,xn)  + 
n  In  (Ztf(p))  (equation  (6.9)):  the  closer  \P\  to  0,  the  closer  P  is  to  the  uniform 
distribution. 

The  last  two  items  show  that  P  can  be  interpreted  as  a  kind  of  ‘noise’  level,  measuring 
for  each  fixed  H  the  apparent  randomness  of  the  data  with  respect  to  hypotheses  H. 
We  use  the  word  ‘apparent’  because  a  small  value  of  P  does  not  mean  that  the  data  are 
random  in  any  general  sense;  it  only  means  that  H  does  not  give  very  much  information 
about  the  data. 
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The  idea  to  turn  a  function,  in  our  case  an  error  function  ER(  |  ),  into  a  class  of  proba¬ 
bility  distributions  P(-|*,/?)  =  Z{(3)~ 1  exp(— /3er(-|-))  is  actually  not  new:  it  is  com¬ 
mon  practice  in  statistical  mechanics  [54]  [86]  [112]  where  P  is  the  probability  density 
function,  Z  the  partition  function,  the  error  function  is  replaced  by  the  ‘energy  func¬ 
tion’  or  Hamiltonian,  and  instead  of  a  parameter  (3  one  uses  a  nonnegative  parameter 
T  (called  a  ‘temperature’)  satisfying  (3  =  1/kT  where  k  is  the  Boltzmann  constant. 
Such  ‘energy  functions’  and  ‘temperatures’  are  frequently  used  outside  a  purely  phys¬ 
ical  context.  As  far  as  we  know,  the  role  of  the  temperature  T  is  somewhat  different 
from  our  (3  since  T  is  not  treated  as  a  parameter  to  be  estimated  from  the  data. 

We  now  present  our  main  results  concerning  entropification.  We  study  the  behavior  of 
entropified  model  classes  when  data  are  independently  distributed  according  to  some 
unknown  ‘true’  distribution  P*.  Roughly,  it  is  shown  that  with  an  entropified  model 
class  (A^)er,  if  given  enough  data,  we  can  find  the  model  in  (A/[)ER  with  the  smallest 
expected  prediction  error  under  P*.  Additionally,  this  model  will  provide  a  correct 
estimate  of  the  average  prediction  error  over  future  data  that  it  will  achieve;  hence  the 
model  gives  a  good  impression  of  ‘how  good  it  really  is’  when  errors  are  measured  by 
ER.  The  important  thing  is  that  this  model  which  is  both  optimal  and  ‘reliable’  will  be 
found  even  is  P*  is  not  contained  in  (A4)ER.  Below  we  state  a  technical  lemma  that  is 
needed  in  our  theorems.  The  theorems  concern  the  general  case  (theorem  6.1),  the  case 
of  simple  error  functions  (theorem  6.2),  the  case  of  the  logarithmic  error  (theorem  6.3) 
and  the  case  of  the  squared  error  (theorem  6.4).  The  lemma  and  the  theorems  assume 
that  the  data  are  i.i.d.  according  to  some  unknown  but  fixed  probability  distribution 
P*.  We  have  to  impose  some  mild  conditions  on  P*.  These  amount  to  the  existence  of 
some  ‘window’  (i.e.  a  bounded  set  containing  more  than  one  element)  within  which  all 
data  will  fall.  The  reason  is  that  otherwise  the  required  expectation  Ep+  (er(F|P,  A)) 
may  not  exist.  Here  is  a  formal  definition  of  this  condition: 

Definition  6.8  (Regularity  condition  for  the  true  distribution) 

Let  a  sample  space  E  =  Ei  x  •  •  •  x  Em  be  given  form  >  1.  Whenever  in  the  following 
we  speak  of  a  * true ’,  or  ‘ generating ’  distribution  P*,  we  assume  P*  to  be  a  distribution 
overEp*  =  Ep*}i  x  •  •  •  x  E p*j7n  with  full  support  such  that  for  1  <  i  <  m,  (a) 
Ep*,!  £  E^  (bj  Ep*}j  contains  more  than  one  element  and  (c)  ifEi  is  continuous ,  then 
Ep*j  is  compact. 


We  continue  by  stating  our  technical  lemma.  Essentially,  it  says  the  following:  if  M  is 
compactly  parameterized,  then  the  average  code  length  of  xn  based  on  the  maximum 
likelihood  model  for  xn  converges  (with  probability  1)  to  the  expected  code  length 
based  on  the  model  in  the  class  that  minimizes  this  expected  code  length.  This  holds 
if  the  data  are  i.i.d.  according  to  some  P*  satisfying  definition  6.8.  Note  that  P*  is  not 
required  to  be  a  member  of  M. 

Lemma  6.2 

Let  M  =  {P(-|0)|0  £  T }  be  a  class  of  i.i.d.  probabilistic  models  over  sample  space 
E  that  is  finitely  parameterized  by  T  C  Rk  where  T  is  compact.  Let  the  data  be  i.i.d. 
according  to  some  P*  satisfying  definition  6.8.  Then  the  following  minima  exist  for 
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all  n,  all  xn  £  En; 

L(xn)  =  min{—  ln(P(xn|0))}, 

(6.35) 

L(P*)  =  mm{EP.(-ln(P(Xm}. 

(6.36) 

We  have  with  P* -probability  1; 

lim  -L(xn)  =  L(P*). 

n— ►oo  n 

(6.37) 

Proof  lemma  6.2:  In  the  proof  we  assume  that  E  is  continuous.  Adaption  to  the  case 
of  E p*  =  Ep*?i  x  •  •  •  x  E p*>m  where  some  of  the  E p*}*  are  discrete  is  completely 
straightforward.  By  compactness  of  Ep*  and  T  and  the  fact  that  we  only  consider  P* 
with  associated  density  functions  the  minima  (6.35)  and  (6.36)  evidently  exist.  We  can 
cover  T  with  a  grid  of  /c-dimensional  rectangles  with  side  width  s.  The  set  T  is  thus 
partitioned  into  a  finite  number,  say  M,  of  rectangles  P*.  Let,  for  1  <  i  <  M,  8l  be 
the  model  in  T  corresponding  to  the  center  of  P*.  In  this  way  we  obtain  a  reduced 
parameter  set  Ts  =  {01,  •  *  *  ,  0M }.  We  first  consider  the  simple  case  where  M  is  such 
that  the  following  four  minima  are  all  attained  by  a  unique  value  for  each  n,  xn  G  En: 

e  =  ndn{EP.(-]n(P(X\e)))}, 

es=min{EP.(-ln(P(X\08)))}, 

s 

6(xn)  =  min{-ln(P(xn|0))}, 

9s(xn)  —  min  {— ln(P(zn|#s))}.  (6.38) 

0S£VS 

We  now  show  in  two  stages  that  (6.37)  holds  in  the  case  where  these  single  minimizing 
values  exist. 

Stage  1:  Let  n  and  e  >  0  be  given.  We  claim  that  if  we  pick  the  rectangle  side  width  s 
small  enough  both  of  the  following  equations  will  hold: 

\EP.(-ln(P(X\es)))  -  EP*(-]n(P(X\0)))\  <  |c,  (6.39) 

|  —  —  ln(P(xn|0s(a:n)))  +  —  ln(P(x"|0(xn)))|  <  -e,  VxnGEP..  (6.40) 
n  Ti  o 

We  first  show  (6.40).  Let 

fn(0,  *o)  -  --  ln(P(®n|0))  +  -  ln(P(x"|0o)) 

n  n 

=  -  -  ln(P(xj|0))  +  ln(P(xj|6»0)).  (6.41) 

n  »= l 

M  is  finitely  parameterized  by  T.  Checking  definition  6.3,  we  see  that  for  all  x  £ 
Ep*,  /i(#,  x, 8q)  regarded  as  a  function  of  9  must  be  continuous  at  all  9  £  T.  By 
compactness  of  E p*  it  is  easy  to  show  that 

/max(0o,0)  =  max  h(0,x,0o),  and 
x€EP* 

/min(^0;  0)  =  min  h(0,x,eo), 

x£EP* 
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are  well-defined  continuous  functions  of  9  with  /max(^0i  Oo)  —  /min(^O)^o)  —  0.  By 
compactness  of  T,  it  is  clear  that  the  following  function  is  well-defined  for  all  S  >  0 

5max(^)  —  m^,x{  |/max(^0?  ^)| }  >  (6.42) 

U{6) 

where  the  maximum  is  taken  over  the  set  U{8)  =  {<9,  $o  £  T\  \0  —  9§\  <  6}.  Moreover 
(compactness  of  T)  one  can  show  that  lim^o  ^max(^)  =  Pmax(0)  =  0.  The  same  holds 
for  ^min(^)  which  we  define  analogously  to  gmax-  These  properties  of  gmax  and  gm[n 
show  the  following:  for  every  e  >  0,  we  can  pick  the  rectangle  width  s  small  enough 
such  that  the  following  implication  holds  for  all  0,  9q  G  T  and  all  xn  G  Ep* 

if  9  and  9q  both  fall  in  the  same  rectangle  P*  then 

/max($o>0)  ^  e/3  and /min(0o?  ^  —e/3.  (6.43) 

It  can  be  seen  from  (6.41)  that  for  n  >  1,  for  all  xn  G  E p* 

/min(^O)^)  <  o)  <  /max($05$)*  (6.44) 

(6.40)  now  follows  by  combining  (6.43)  and  (6.44)  and  substituting  9(xn)  for  90.  A 
similar  but  simpler  argument  shows  that  (6.39)  holds.  We  omit  the  arguments. 

Stage  2:  By  the  strong  law  of  large  numbers  [30],  we  have  with  P*-probability  1 
that  for  all  9S  G  Ts,  for  all  S  >  0,  there  exists  an  no  such  that  if  n  >  no  then 
|  -  n-1  ln(P(£n|0s))  -  Ep*  (—  ln(P(X|05)))|  <  S .  Since  Ts  contains  only  a  finite 

number  of  elements,  this  implies  that,  for  all  e  >  0,  with  P* -probability  1  there  exists 


an  no  such  that  for  all  9S  G  Ts 

n  >  n0  =►  |  -  I  MP(xn\0s))  ~  EP.  (-  ln(P(X|0,)))l  <  ±e.  (6.45) 

In  addition,  we  also  have  for  all  xn  G  Ep* 

EP. (-  ln(P(X-&(*"))))  >  EP. (-  ln(P(X|05))),  (6.46) 

—  ~  ln(P(xnl&s(xn)))  <  -±ln(P(xnl08)).  (6.47) 

By  first  applying  (6.45)  with  9S  =  9s(xn )  and  then  (6.46)  we  find 

~ln(P(x”|0a(z")))  >  EP.(-ln(P(X\§s)))  -  l€.  (6.48) 

Using  (6.47)  and  then  applying  (6.45)  with  9S  =  9S  we  find 

-^ln(P(x"|<?s(xn)))  <  EP.(-ln(P(X\6s)))  +  ±e.  (6.49) 

Combining  (6.39),  (6.40),  (6.48)  and  (6.49)  we  find  that  for  all  e  >  0,  there  exists  an 
no  such  that  with  P* -probability  1,  for  all  n  >  no 

|  -  i  ln(P(xn\9(xn)))  -  EP.  (-  ln(P(X|<9)))|  <  c,  (6.50) 


which  is  equivalent  to  (6.37).  This  proves  the  lemma  for  the  case  that  the  four  minima 
in  (6.38)  are  all  attained  by  single  values  in  the  parameter  space. 
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If  this  is  not  the  case  we  proceed  as  follows:  by  compactness  all  minima  exist;  the 
only  problem  is  that  they  may  be  attained  for  several  values.  This  can  be  handled  by 
defining  ©  as  the  set  of  all  9  minimizing  Ep *  (—  \n(P(X |0)))  (analogously  to  the  first 
line  of  (6.38))  and  defining  ©s,  0  and  05  similarly.  By  the  same  reasoning  as  in  stage  1 
of  the  proof,  we  can  now  prove  the  following  existentially  quantified  version  of  (6.39) 
and  (6.40).  Let  n  and  e  >  0  be  given.  We  claim  that  if  we  pick  the  rectangle  side 
width  s  small  enough  then  there  exist  9S  E  0S,  6  E  0,  9s(xn )  E  0S  and  9(xn)  E  0 
such  that  (6.39)  and  (6.40)  hold.  By  the  same  reasoning  as  in  stage  2  of  the  proof, 
we  can  also  prove  a  universally  quantified  version  of  (6.39)  and  (6.40).  If  n  is  large 
enough,  then  for  all  6s(xn)  E  ©s  equations  (6.48)  and  (6.49)  hold  with  P* -probability 
1.  Combining  these  new  versions  of  (6.39),  (6.40),  (6.48)  and  (6.49)  we  can  proceed 
as  above  to  show  that  (6.50)  holds.  The  lemma  then  follows.  □ 

We  now  give  an  informal  overview  of  the  theorems  we  are  about  to  prove.  We  start  by 
defining  an  analogue  of  the  definition  of  ‘reliable’  (definition  6.2)  for  the  setting  where 
some  true  (i.i.d.)  distribution  is  assumed  to  exist. 

Definition  6.9 

Let  the  data  be  i.i.d.  according  to  some  distribution  P* .  Let  P  be  some  given  proba¬ 
bilistic  model  over  E  and  let  ip  :  E  — >  U  be  some  given  function.  We  call  P  reliable 
with  respect  to  under  P*  if 

Ep(i>(X))  =  EP*ty(X)). 


A  model  P  that  is  reliable  with  respect  to  ^  under  P*  is  (with  probability  1)  guaranteed 
to  give  a  correct  impression  of  the  average  ^{x)  for  large  n:  by  the  law  of  large 
numbers  *0( x )  Ep *  ('ip(X))  =  Ep(^(X))  as  n  increases,  with  P* -probability  1. 

Let  (jM)er  be  a  model  class  entropified  with  respect  to  an  error  function  ER.  Let 
the  data  be  i.i.d.  according  to  some  arbitrary  P*  (not  necessarily  in  (A4)ER).  The 
main  point  of  theorem  6.1  is  that  for  each  H  E  M,  there  exists  a  unique  ftp  such 
that  (1)  E(H  ^(er(Y|P,  2f))  =  £?p*(er(Y|P,  X))  (hence  (P,  /5p)  is  reliable  with 

respect  to  ER  under  P*),  and  (2),  /3(D|P),  the  maximum  likelihood  estimator  for 
fixed  H  (definition  6.6),  converges  with  P* -probability  1  to  /3p.  Hence  for  each  P, 
a  reliable  estimate  of  its  performance  can,  with  probability  1  be  obtained.  If  the  er¬ 
ror  function  is  simple  (definition  6.7),  then  in  addition,  the  stronger  theorem  6.2  ap¬ 
plies.  Its  essence  is  (roughly)  that  the  maximum  likelihood  estimator  (P,  /3)  con¬ 
verges  with  P* -probability  1  to  the  model  (P,/3)  where  P  is  the  optimal  model 
in  M,  minimizing  the  ‘true’  expected  error  Pp*(ER(Y|P,X)),  and  /3  is  such  that 
E^pr ^(er(Y|P,X))  —  Pp*(er(Y|P,  X)),  and  so  (P,  /3)  is  reliable  for  ER(-|P,  ♦) 

under  P*.  Hence  the  optimal  P  and  a  reliable  estimate  of  its  performance  can  both, 
with  probability  1,  be  obtained.  If  the  error  function  is  not  simple,  then  things  get  more 
complicated.  Nevertheless,  theorem  6.3  shows  that  in  the  special  case  of  the  (non¬ 
simple)  logarithmic  error  function,  an  analogue  to  the  above  (maximum  likelihood 
estimators  converging  to  an  optimal  and  reliable  model)  still  holds.  The  squared  error 
function  is  simple  but  satisfies  an  additional  interesting  property  as  will  be  shown  in 
theorem  6.4. 
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In  the  theorems  we  make  use  of  the  maximum  likelihood  estimator  (3(D\H)  for  fixed 
H  which  is  defined  in  definition  6.6.  It  is  straightforward  to  show  that,  under  our  condi¬ 
tions  for  the  sample  space  of  the  generating  distribution  P*,  a  unique  value  of  /3(D\H) 
always  exists.  Sometimes  we  will  also  make  use  of  the  full  maximum  likelihood  esti¬ 
mator  (P,  $)(D).  In  all  these  cases,  it  is  straightforward  to  show  there  exists  at  least 
one  maximum  of  the  likelihood.  We  use  the  convention  that,  if  several  (P,  / 3 )  max- 
imums  of  the  likelihood  exist,  then  ( P ,  /3)  denotes  the  first  one  according  to  some 
prespecified  ordering  over  (M)ER. 

Theorem  6.1 

Let  E  =  Ex  x  Ey.  Let  M  be  a  class  of  models  and  let  ER  :  E  x  M  —►  M  be  an  error 
function  for  M.  Assume  that  the  data  (x,  y)  are  generated  by  independent  sampling 
from  a  distribution  P *  over  Ep*  as  in  definition  6.8 .  Then  for  all  fixed  H  G  M, 
Ep*(ER(Y\H)X))  exists,  and 

1.  there  exists  a  unique  ftp  depending  on  H  such  that 

E(hM(er(Y\H,  X))  =  Ep.  (er(Y\H,  X)),  (6.51) 

and  at  the  same  time,  for  all  /3  G  Tnat(H)  with  (3  ^  /3p, 

EP*  (-  ln(P(F|/?,  H ,  X)))  >  EP .  (-  ln(P(Y\pH,  P,  X ))),  (6.52) 

2.  with  P*  -probability  1 

lim  0(xn,  yn\H)  =  (3h,  (6.53) 

n— » oo 

and  hence 

£S^ElHMjm)(ER(Y\H,X))  =  Ep.(er(Y\H,X)).  (6.54) 

Proof  theorem  6.1:  We  only  proof  the  theorem  for  continuous  Ex  and  Ey.  The  case 
where  E^  or  Ey  or  both  are  discreet  is  completely  analogous.  We  first  prove  existence 
of  Ep *  (er(F|P,  X)).  Definition  6.8  tells  us  that  P*  is  defined  over  a  compact  sub¬ 
space  of  E.  Conditions  C1-C3  on  ER  make  sure  that  ER (y\H,x)  is  continuous  at  all 
(x,  y)  G  E.  For  continuous  E,  we  only  consider  P*  with  associated  continuous  den¬ 
sity  functions.  Existence  of  Ep*  (er(F|P,  X))  now  follows.  Now  to  prove  item  1, 
note  that  by  definition  6.8,  Ep*  (er(F|P,  X))  must  lie  in  the  interior  of  Up.  There¬ 
fore  there  must  be  a  unique  value  /3p  for  which  (6.51)  holds.  We  have,  for  each  /?, 
Pp*(-ln(P(F|P,/?,X)))  =  pEp*(ER(Y\H,X))  +  In {ZH(J3)).  By  differentiating 
with  respect  to  (3  one  verifies  that  Ep*(—  ln(P(Y\H,  f3,  X)))  as  a  function  of  (3  is 
convex  and  reaches  its  unique  minimum  at  the  value  f3p  for  which  (6.51)  holds.  This 
proves  (6.52).  Concerning  item  2,  we  will  first  prove  (6.54).  Since  the  data  are  i.i.d.  we 
can  apply  the  strong  law  of  large  numbers  [30]  which  gives  that  with  P* -probability 
1,  n~l  l  er(^I^5  xi)  converges  to  Ep*  (er(F|P,  X)).  (6.54)  then  follows  by  the 
reliability  of  estimates  of  ER  (proposition  6.2).  Relation  (6.53)  is  now  immediate  by 
(6.54)  and  the  fact  (which  we  just  showed)  that  Ep *  (—  ln(P(F|/3,  P,  X)))  as  a  func¬ 
tion  of  / 3  is  convex  and  reaches  its  single  maximum  at  (3  =  f3p.  □ 

For  simple  error  functions  the  following  applies:  if  (3  >  0  then  minimization  of  the  log¬ 
arithmic  error  -  ln(P(yn|P,  /?,  xn))  corresponds  to  minimization  of  the  error  function 
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ER.  This  allows  us  to  prove  theorem  6,2,  which  says  that  the  maximum  likelihood  es¬ 
timator  ( if ,  /3)  for  data  D  converges  to  a  model  ( if ,  /?)  where,  if  /3  >  0,  then  H  mini¬ 
mizes  the  ‘true’  expected  error  Ep*  (er(Y | if,  X ))  over  all  H  £  M.  Let  us  briefly  con¬ 
sider  the  case  j3  <  0.  In  the  case  of  the  squared  error,  Tnat(H)  only  contains  positive 
parameter  values,  so  then  always  /3  >  0  and  the  problem  does  not  occur.  In  the  special 
case  of  the  0/ 1-error,  something  interesting  happens  which  we  illustrate  with  an  exam¬ 
ple.  Suppose  our  concept  class  M  contains  only  two  models  H i  and  #2 •  Suppose  P* 
to  be  such  that  Ep*(er(Y\Hi}X))  =  0.3  and  Ep*(er(Y\H2,X))  =  0.9.  Then  the 
hypothesis  minimizing  the  expected  0/1-error  is  clearly  Hi.  However,  #2  can  be  triv¬ 
ially  modified  into  another  ‘inverse’  hypothesis  #2  with  Ep*(ER(Y\H2,  X))  =  0.1: 
H2(x)  predicts  1  if  H2(x )  =  0  and  0  otherwise.  This  trivial  modification  can  be 
achieved  by  entropification:  the  entropified  model  ( if ,  /3)  that  leads  to  the  shortest 
expected  code  length  will  in  our  example  be  given  for  H  =  H2  and  f3  <  0;  the  fact 
that  j3  <  0  makes  H2  behave  like  its  inverse  if 2*  #2  will  lead  to  much  shorter  (ex¬ 
pected)  code  lengths  than  H\  (all  this  can  be  easily  checked  using  (6.11)  and  (6.12)  of 
example  6.2). 

Theorem  6.2 

Let  E,  data  (x,  y ),  P*  and  Ep *  be  as  in  the  statement  of  theorem  6.1.  Let  ER  be  a 
simple  error  function  and  assume  M  to  be  such  that  (M)ER  is  finitely  parameterized 
by  VM  x  r nat  where  VM  is  compact  Then 

1.  The  following  minima  exist 

ER (P*)  =  min  Ep*(er(Y\H,X )),  (6.55) 

L(n  =  min  Ep*  (—  \n(P(Y\0,  X))).  (6.56) 

Pf\9,-)e{M)En 

Let  9  be  one  of  the  models  for  which  the  minimum  in  (6.56)  is  obtained.  Then 

2.  6  =  (if,  (3)  for  some  (3  G  Tnat.  If  (3  >  0  then  H  is  (one  of)  the  hypothesis 
(hypotheses)  for  which  the  minimum  in  (6.55)  is  obtained  ((3  is  identical  for 
all  such  H). 

Let  (if,  /?)  =  (P,  f3  )(D)  denote  the  maximum  likelihood  estimator  in  {M)ER. 

3.  We  have 

Um^4)(ER(y|^,X))  =  EM(BR(Y\H,X)) 

=  Ep.(m(Y\H,X)).  (6.57) 

Hence,  for  each  ‘true’,  ‘generating’  distribution  P*  there  exists  an  optimal  model 
(if,  /§)  such  that  the  ‘true’  expectation  under  P*  of  the  error  ER(Y|if,  X)  is  minimal 
and  equal  to  the  expectation  of  this  error  under  (Hj 3):  when  given  enough  data,  every 
reasonable  (definition  6.1)  inference  procedure  will  hit  upon  a  model  that  is  optimal  in 
this  sense. 

Proof  theorem  6.2:  We  only  prove  the  theorem  for  continuous  and  Ey.  The  case 
where  Ex  or  Ey  or  both  are  discrete  is  completely  analogous.  Concerning  items  1  and 
2,  existence  of  ER(P*)  is  straightforward  by  compactness  of  Tjvi.  Existence  of  L(P*) 
and  item  2  will  now  be  proven  at  the  same  time.  First  write 

Ep*  (-  hi(P(Y\e,  X )))  =  PEP.  (ER  (y  Iff,  X))  +  In  (Z(P)), 


(6.58) 
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for  (H,  (3)  =  9.  Since  we  assume  ER  to  be  simple  here,  Z((3)  does  not  depend  on 
H.  This  shows  that  for  each  fixed  (3  >  0,  Ep*(—\n(P(Y\(H:  /3),X)))  reaches  its 
minimum  for  the  set  Ti+  of  H  minimizing  Ep*  (er(Y\H,  X)).  By  differentiating  with 
respect  to  /?  and  the  fact  that  Z(/3)  does  not  depend  on  H ,  one  finds  that  there  ex¬ 
ists  a  single  /3+  minimizing  Ep*(—\n(P(Y\(H,/3),X)))  for  all  H  G  For  fixed 
(3  <  0,  Ep*(—  ln(P(Y\(H,  (3),  X)))  reaches  its  minimum  for  the  set  7Y-  of  H  maxi¬ 
mizing  £p*(ER(y|#,  X))  (which  exists  by  compactness  of  T^t).  Now  there  exists  a 
single  minimizing  Ep*  (-  ln(P(y|(i7,  /?),  X)))  for  all  H  G  H~ .  If  (3  =  0,  then 
Ep*( -  ln(P(Y\(H,  /3),X)))  reaches  its  minimum  for  all  H  G  M.  From  this  it  eas¬ 
ily  follows  that  a  (/3,  H)  minimizing  (6.58)  exists,  that  all  minima  of  (6.58)  have  the 
same  component  j3  and  that  if  /3  >  0,  then  H  G  77+.  This  proves  both  existence  of 
L(P*)  and  item  2  of  the  theorem.  The  key  to  the  proof  of  item  3  is  the  result  of  (6.63) 
below.  In  order  to  obtain  this  result  we  need  to  apply  lemma  6.2.  The  lemma  cannot 
be  simply  applied  to  model  classes  (A4)ER,  since  these  contain  conditional  rather  then 
regular  probabilistic  models.  To  avoid  this  problem  we  change  (M)ER  into  a  class  of 
essentially  equivalent  regular  probabilistic  models  over  E^  x  Ey  by  extending  it  with 
the  uniform  distribution  over  E^  (this  is  possible  since  Ex  is  compact  by  condition 
C4).  Let  Pu  be  the  uniform  distribution  over  Ex  and  let 

Pu(xnyyn\H,f3)  =  P(yn\Hy(3,xn)Pu(xn),  (6.59) 

be  the  distribution  that  extends  each  conditional  distribution  P( •]//,/?,  *)  to  a  full  dis¬ 
tribution  over  Ex  x  Ey.  We  have  for  all  H  e  M 

-  ln(Pu(xn,yn\Hj))  =  -  ln(P(yn\Hj,xn))  +  C  •  n,  (6.60) 

for  constant  C.  Here  /3  is  as  in  the  statement  of  the  theorem,  item  2.  Let 

LU(P*)  =  min  EP.  (-  ln(P(yn\H,  0,  xn)))  =  L(P*)  +  C,  (6.61) 

where  C  is  the  same  constant  as  in  (6.60).  Finally,  let  (M)q  =  {P(-,  -\H,  P)\H  G  M} 

be  the  class  of  probabilistic  models  of  form  (6.59)  for  which  (3  =  /3.  It  is  straight¬ 
forward  to  check  that  {M)p  is  such  that  lemma  6.2  applies.  Substituting  L(xn)  = 

— n"1  In (Pu(xn,  yn\H ,  /?)),  this  gives  that  with  P* -probability  1 

lim  --  ln(P“(xn,  yn\H,  0))  =  LU(P*).  (6.62) 

n— >oo  71 

Relations  (6.60),  (6.61)  and  (6.62)  give  us  (with  /^-probability  1) 

lim  --ln(P(yn\H,0,xn))  =  L(P*)  =  E-J~\n{P{Y\H,X))),  (6.63) 

n—>oo  71  H 

for  all  the  H  minimizing  (6.55),  where  the  last  equality  follows  from  item  2  in  the  state¬ 
ment  of  the  theorem  which  we  proved  already.  Exploiting  the  identity  —  ln(P(yn\H ,  /?,  xn )  = 
/3ER(yn\H,  xn )  +  n  1  n(Z(/3))  with  P* -probability  1,  we  have 

lim  -~pER(yn\H,  xn )  +  In (Z(0))  =  0Es(er(Y\H  ,  X ))  +  ln(Z(0)).  (6.64) 

n— +oo  71  H 

By  reliability  of  estimates  of  ER  (proposition  6.2)  we  have  that  n_1ER(t/n|i/,  xn)  = 

E(h  /5)(ER(^|^,  X)).  Plugging  this  into  (6.64)  proves  (6.57).  □ 


TNO  report 


TNO-DV1  2004  A234 


127 


It  is  in  general  difficult  to  analyze  for  non-simple  error  functions  whether  an  analogue 
of  theorem  6.2  holds.  The  proof  of  theorem  6.2  is  based  on  the  fact  that,  for  simple 
error  functions,  minimization  of  logarithmic  error  corresponds  to  minimization  (or 
maximization)  of  the  error  ER.  For  non-simple  error  functions  this  need  not  be  the  case 
since  Z((3)  varies  with  H.  However,  a  special  case  occurs  if  M  is  probabilistic  and 
we  entropify  with  respect  to  the  logarithmic  error.  In  that  case,  the  function  ER  =  ERig 
measures  itself  the  log-likelihood  of  the  data,  while  the  optimal  model  in  (M)  Er  is  also 
optimal  with  respect  to  expected  log-likelihood.  This  allows  an  analogue  to  theorem 
6.2  to  be  proven  after  all;  it  is  embodied  in  theorem  6.3  below.  In  this  situation,  it 
turns  out  to  be  somewhat  harder  to  identify  exact  conditions  under  which  the  required 
minima  exist.  Specifically,  let  M  be  a  class  of  probabilistic  models  over  sample  space 
E  that  is  finitely  parameterized  by  and  let  (A^)ERi9  be  the  entropification  of  M 
under  the  logarithmic  error.  {.Ad)ERjg  can  be  parameterized  by  Ym  x  T'natiH)'  Let 
data  x\,  X2,  •  *  *  be  generated  by  independent  sampling  from  distribution  P*  over  Ep*, 
where  P*  is  as  in  definition  6.8.  We  assume  that  (1)  Ep*  is  such  that  for  all  n,  xn  E 
Ep*,  the  maximum  likelihood  estimator  of  xn  in  {.Ad)ERlg,  denoted  by  9  =  9(xn), 
exists  and  falls  within  a  compact  subset  of  Yj^  x  Ynat(H),  and  (2) 

X(P*)ee  min  EP*(-ln(P(X\0))),  (6.65) 

M  nat\H ) 

exists  and  is  obtained  by  a  single  model  9. 

Theorem  6.3 

Let  M,  P*  and  8  be  as  above.  Then  it  follows  with  P* -probability  1 

ton  %*«)(- =  Ef(-HP(X\m 

n — kx>  v.  / 

=  EP*(-\n(P(X\6))).  (6.66) 


Proof  theorem  6.3:  In  the  proof  we  assume  the  notation  of  example  6.3.  Specifically, 
P(-,  rj)  stands  for  the  i.i.d.  probabilistic  model  in  M  indexed  by  77,  and  P(*  |  (77,  (3))  = 
Z~1(f3)exp(f3ln(P('\ri)))  stands  for  the  model  in  (Ad)ERig  indexed  by  (77,  /3).  Note 
also  that  9  —  (77,  /3).  By  reliability  of  the  estimates  of  ER^  when  the  class  (A^)ERi5  is 
used  (proposition  6.2)  we  find 


By  straightforward  calculation  this  gives 


1 

n 


J2HP(xi\v))- 

i—1 


t=i 


(6.67) 


Let  {M)fERlg  be  the  restriction  of  (Ad)ERfg  to  models  with  parameter  values  in  the 

compact  set  within  which  9(xn)  must  fall  (we  assume  that  such  a  set  exist).  Clearly, 
8  must  be  a  member  of  this  set.  Now  we  can  apply  lemma  6.2  to  (M}'ER.  This  gives, 
with  probability  1 

lim  ~-ln(P(xn\e(xn)))  =  L(P*). 

n— >00  n 


(6.68) 
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Together,  (6.67)  and  (6.68)  show  that  (6.66)  holds.  □ 

In  classical  statistics,  the  problem  of  curve-fitting  is  cast  in  the  following  terms:  one 
assumes  data  to  be  independently  generated  by  some  unknown  distribution  P*  and 
one  tries  to  identify  the  function  77*  (called  the  ‘regression  function’)  that,  for  each 
x,  gives  the  expected  value  (the  mean)  of  Y  given  that  X  =  x  (some  would  prefer  to 
say:  one  assumes  that  data  are  generated  by  some  function  77*  with  errors  distributed 
according  to  P*).  Whatever  the  distribution  of  the  errors,  this  can  be  achieved  by  using 
the  squared  error  function  in  the  learning  phase,  as  will  be  shown  in  theorem  6.4  below. 
Such  results  have  been  known  for  a  long  time  [13].  For  completeness  and  since  it  is 
not  difficult,  we  included  theorem  6.4  nevertheless. 

Let  E  =  Ex  x  Ey  where  Ex  C  R  and  Ey  =  R.  We  assume  data  to  be  generated  by 
independent  sampling  from  P *  as  in  definition  6.8.  Let  H*(x)  =  Ep*(Y\X  =  x). 
That  is,  77*  (x)  gives  the  mean  of  Y  for  each  x  G  Ex.  We  will  assume  that  77*  (x)  is 
continuous  at  all  x  G  Ex.  Let  (a*)2  =  Ep*((Y  -  77* (X))2).  Hence  (cr*)2  denotes 
the  ‘expected  true  variance’  of  Y.  We  know  that,  for  a  given  class  M  of  functions 
Ex  — >  Ey,  (A^er^  consists  of  conditional  Gaussian  distributions.  These  distributions 
are  obtained  from  the  natural  parameterization  of  (M)ERsq  by  substituting  (3  =  \<J2> 
Under  the  natural  parameterization,  theorems  6. 1  and  6.2  are  applicable.  The  following 
theorem  extends  these  theorems  for  the  specific  case  of  (A4)ER  .  Briefly,  the  only 
non-trivial  results  that  are  added  are  the  following:  (1)  for  every  P*,  the  optimal  model 
P(-|<r2,  77)  —  P(  |/5, 77)  will  be  such  that  77  is  the  function  in  M  that  is  closest  (in  the 
mean  squared  error  sense)  to  the  ‘true’  function  77*  and  (2)  a2  can  be  interpreted  as  the 
mean  squared  error  of  77.  Since,  for  every  P*,  (77,  a2)  will  converge  to  (77,  a2)  with 
P* -probability  1,  this  implies  that  in  the  special  case  with  77*  G  M,  77  will  converge 
to  the  true  77*  and  a2  will  converge  to  the  true  variance  (<r*)2  with  P* -probability 
1.  This  holds  independently  of  whether  P*(-\x)  is  Gaussian  or  not.  In  the  following 
theorem,  we  assume  models  in  (A/f)ER  to  be  specified  by  (77,  a2)  rather  than  (77,  /?). 

Theorem  6.4 

Let  E,  M,  ( M)ERsq  and  P*  be  as  above .  Then 

1.  For  all  H  G  M,  Ep*  (ER5g(F|T7,  X))  exists ,  and  there  exists  a  djj  depending 
on  77  such  that 

E{h^h){{Y  -  H{X)f)  =  Ep*((Y  -  H{X)f) 

=  K)2  +  EP •  ((H*(X)  -  H(X))2).  (6.69) 

2.  Further  assume  M  to  be  such  that  (M)ERsq  is  finitely  parameterized  by  x 

F nat  where  is  compact.  Then  the  following  minimum  exists 

a2  =  min  EP*((Y  -  H(X))2) 

HeM 

=  (a*)2+  min  EP-((H*(X)  -  H{X))2).  (6.70) 

HClM 

Let  H  be  one  of  the  models  for  which  the  minimum  in  (6.70)  is  obtained.  We 
have 

nlim  E(M(Y  -  H(X))2)  =  E{ii  .2)((Y  -  H(X))2)  =  a2  (6.71) 

=  (O2  +  min  EP.((H*(X)  -  H{X))2). 
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In  particular ;  if  H*  G  M,  then  d2  converges  with  probability  1  to  the  true 
variance  (a*)2  and  H  converges  to  the  true  hypothesis  if*. 

Proof  theorem  6.4:  Most  of  item  1  of  the  theorem  is  straightforward  from  theorem  6.1 ; 
the  only  thing  that  still  needs  to  be  proven  is  the  fact  that 

EP.((Y  -  H(X))2)  =  (a*)2  +  EP,({H*(X )  -  H(X))2).  (6.72) 

Item  2  follows  theorem  6.2.  The  only  thing  that  is  not  obvious  from  this  theorem  is, 
once  more,  (6.72),  and  additionally 

E{^){{Y-\H)f)  =  a2.  (6.73) 

It  is  a  standard  fact  of  regression  [3]  (also  straightforward  to  verify  by  calculation) 
that  for  all  a2  and  if,  we  have  E^Ha2^((Y  -  H(X))2)  =  a2.  This  shows  (6.73). 
Equation  (6.72)  is  a  variation  of  the  well-known  bias-variance  decomposition  [40], 
also  straightforward  to  prove 

EP.{{Y  -  H{X))2)  -  EP.((Y  -  H*(X))2)  (=> 

Ep.(x)(EP.(ym(2YH*(X)  -  2 YH(X)  +  H(X)2  -  H*(X)2 \X  =  x))  (=} 
EP*{(H(X)-H*{X))2).  (6.74) 

(1)  follows  from  using  the  linearity  of  expectation,  working  out  the  squares  and  con¬ 
ditioning  on  X .  (2)  is  obtained  by  using  Ep*  (Y\X  =  x)  =  H*(x).  From  (6.74),  the 
equality  (6.72)  is  immediate.  □ 


6.2  Entropification  and  MDL 

Is  entropification  merely  a  convenient  tool  to  make  predictions  reliable  or  are  there 
additional  reasons  as  to  why  we  should  ‘entropify’  our  model  classes?  In  this  section 
we  show  that  if  we  use  the  MDL  principle  as  our  statistical  inference  procedure,  then  it 
is  often  a  good  idea  to  use  an  entropified  model  class  for  at  least  two  different  reasons: 
first,  entropification  can  serve  to  optimize  the  trade-off  between  hypothesis  complexity 
and  goodness  of  fit  as  needed  in  the  two-part  MDL  code  and  the  stochastic  complexity. 
Second,  it  leads  to  codes  for  non-probabilistic  model  classes  that  can  be  justified  in 
terms  of  minimizing  expected  code  lengths. 

There  have  been  different  proposals  in  the  literature  on  how  to  deal  with  two-part 
codes  and  stochastic  complexity  codes  for  non-probabilistic  model  classes  in  the  MDL 
framework.  For  simplicity  we  will  restrict  our  discussion  to  the  two-part  codes.  Recall, 
that  in  the  basic,  probabilistic  case,  we  select  H  minimizing 

-\og(P{xn\H))  +  LCl(H),  (6.75) 

where  C\  is  some  code  used  for  encoding  the  parameters  indexing  the  hypothesis.  In 
[92]  it  is  proposed  to  turn  a  non-probabilistic  model  class  into  a  class  M ^  (which 
essentially  corresponds  to  entropification  with  f3  —  1).  This  leads  to  finding  the  if 
minimizing 


m(yn\H,  Xn )  +  n  In (ZH(P))  +  LCl  (H) 


(6.76) 
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with  (3  —  1.  The  problem  here  is  that  choosing  (3  =  1  is  essentially  arbitrary  but  can 
have  large  consequences:  choosing  a  different  value  of  (3  we  may  end  up,  at  least  for 
small  n,  with  77  of  different  complexity  (the  closer  (3  to  0,  the  larger  the  relative  weight 
of  the  complexity  term  on  (6.76)). 

In  [7]  it  is  proposed  to  select  the  hypotheses  77  minimizing  the  sum  of  the  empirical 
error  ER(yn\H ,  xn)  and  the  square  root  of  the  complexity  term  times  the  sample  size, 
yJnLc !  (77).  While  this  criterion  can  be  shown  to  have  some  strong  asymptotic  prop¬ 
erties,  it  is  in  a  sense  not  faithful  to  the  MDL  principle  since  the  resulting  sum  does 
not  have  a  natural  interpretation  as  a  code  length. 

In  [125]  it  is  proposed  to  minimize  /?ER(yn|77,  xn )  +  Lqx  ( 77 )  for  some  /?  whose  value 
is  made  dependent  on  the  size  of  the  training  set.  Once  more,  this  has  strong  asymptotic 
properties,  but  again,  it  is  not  clear  how  to  interpret  the  resulting  sum  from  a  purely 
coding-theoretic  point  of  view. 

Instead,  we  propose  to  entropify  M  and  then  use  (6.76)  (now  with  an  additional  term 
Lc i  ((3)  added  to  account  for  the  number  of  bits  needed  to  encode  /?).  We  think  there 
are  several  advantages  to  using  entropified  model  classes.  We  first  note  that,  at  least 
non-asymptotically,  using  the  entropified  class  (M )ER  can  lead  on  to  choose  different 
models  for  the  same  data  than  when  using  for  fixed  (3 •  We  give  an  example. 

Example  6.4 

Consider  a  class  of  continuous  functions  M  entropified  with  the  squared  error.  Let 
data  D  =  (x71,  yn)  and  model  77  G  M  be  given.  Denote  the  average  squared  error  77 
makes  on  D  by  ERsq.  Using  the  code  (6.76)  for  fixed  /?,  we  obtain  as  total  description 
length  of  the  yn  given  the  xn 

L(yn ; H\xn)  =  n  (er^  +  ±  In  (fj)+  LC, (H),  (6.77) 

while  using  (6.76)  for  the  entropified  model  (77,  (3)  where  (3  =  (3(D\H)  is  the  param¬ 
eter  that  maximizes  the  likelihood  of  D  given  77,  we  obtain 

L  V;  H\xn)  =  \n  (1  +  ln(27r)  +  ln(ER^))  +  LCl  (/?)  +  LCl  (H),  (6.78) 

which  depends  logarithmically  rather  than  linearly  on  the  average  error  (both  equations 
can  be  verified  by  substituting  (3  =  ^cr2).  When  two-part  code  MDL  is  used,  the  MDL 
optimal  /?mdi  for  given  D  and  fixed  77  will  not  be  equal  to  $  but  nevertheless  it  will 
be  reasonably  close.  Lqx  ((3)  will  be  equal  to  ^  log(n)  +  c  for  some  constant  c.  This 
implies  that  there  can  be  very  well  be  hypotheses  Hi  and  772  with  different  number 
of  parameters  (so  Lq^Hi)  L<71(772))  such  that  Hi  minimizes  (6.77)  while  772 
minimizes  (6.78).  In  such  a  case,  two-part  code  MDL  based  on  the  entropified  model 
class  leads  to  a  different  optimal  77.  0 

Using  (6.76)  with  entropified  model  classes  A7ER  allows  the  sum  (6.76)  (now  with 
additional  term  Lq1  (/?))  to  be  interpreted  as  a  code  length.  By  learning  the  optimal 
value  of  (3  from  the  data  (which  is  what  entropification  in  the  two-part  code  setting 
amounts  to),  we  essentially  choose  the  value  that  allows  the  shortest  code  length  of  the 
data,  which  is  in  line  with  the  general  MDL  philosophy.  Moreover,  since  each  model 
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(Af}ER  corresponds  to  a  code,  we  can  also  define  stochastic  complexity  with  respect 
to  such  model  classes  in  the  usual  way  and  use  it  as  a  basis  of  model  class  selection; 
it  allows  us  to  compare  different  model  classes  based  on  different  error  functions  for 
the  same  data,  since  the  performance  of  all  the  classes  are  measured  using  the  same 
criterion,  namely,  the  code  length.  This  is  once  again  in  line  with  the  general  MDL  phi¬ 
losophy  of  using  code  lengths  as  a  ‘universal  yardstick*  [92],  to  be  employed  whenever 
different  models  or  model  classes  are  to  be  compared  for  the  same  data. 

Another  thing  to  be  said  for  entropification  is  that  it  unifies  different  instantiations  of 
MDL.  In  the  existing  literature  on  MDL,  the  question  of  how  to  code  the  data  given 
an  hypothesis  has  been  given  different  answers  depending  on  the  category  of  model 
class  used.  For  probabilistic  model  classes,  generally  the  Shannon-Fano  code  with 
L(D)  =  -log (P(D))  is  used  [8],  [92].  For  concept  classes  (classes  consisting  of 
functions  Ex  {0, 1}),  the  usual  approach  [83]  has  been  to  explicitly  code  the  mis¬ 
takes  a  hypothesis  H  makes  on  data  D.  For  the  case  of  non-probabilistic  model  classes 
with  arbitrary  error  functions  ER,  there  have  been  several  proposals,  as  we  saw  above. 
Entropification  (where  data  D  —  (xn,yn)  given  hypotheses  (i?,/3)  is  encoded  using 
the  code  lengths  -log(P(yn\H,  /?,  xn))  is  an  approach  to  handle  non-probabilistic 
model  classes  that  contains  the  existing  treatments  of  probabilistic  and  concept  classes 
as  special  cases.  In  the  probabilistic  case,  as  long  as  the  model  class  is  a  full  expo¬ 
nential  family,  then  entropification  will  not  change  anything.  In  the  case  where  M  is 
a  concept  class,  the  code  based  on  entropification  with  respect  to  the  0/ 1-error,  while 
superficially  different,  is  essentially  equivalent  to  the  traditional  approach  of  coding, 
i.e.,  the  mistakes  H  makes  on  D.  This  suggests  (but  does  not  prove  of  course)  that 
entropification  can  serve  as  the  general  ‘preprocessing*  tool  to  make  a  single  version 
of  two-part  code  MDL  applicable  to  essentially  arbitrary  model  classes.  We  now  show 
this  formally  in  the  example  below. 

Example  6.5  (Concept  learning  and  Bernoulli  parameters) 

Let  M  be  a  class  of  concepts  over  E  =  Ex  x  {0, 1}  and  let  the  observational  data 
D  =  (xn,  yn).  Two-part  codes  for  concept  classes  are  traditionally  [83]  based  on  the 
following  coding  scheme:  the  x\  are  regarded  as  given.  The  yi  are  encoded  by  first 
encoding  an  hypotheses  H  £  M  and  then  encoding  the  exceptions  to  H ,  which  are 
all  the  indices  i  for  which  yi  ^  H(xi).  We  assume  that  hypotheses  are  encoded  using 
some  fixed  code  C\  :  M  —>  B*.  Clearly,  given  the  Xi ,  H  and  the  list  of  expectations 
M  =  {ii,  •  ■  •  ,4}  we  can  fully  reconstruct  yi,  *  •  •  ,  yn  (for  x i  with  i  $  M  we  set  y*  = 
H(xi);  for  Xi  withi  €  Mwesety*  =  |1  —  iJ(xi)|).  If  iJ  makes  fc  mistakes  on  a  sample 
D  of  length  n,  there  are  (£)  different  exception  sets  M  =  {ij,  •  •  *  ,  i *.}.  Hence  we  need 
In  (£)  +  L(k)  nats  to  encode  all  these  mistakes.  Here  L(k)  =  0(]n(k))  equals  the 
number  of  nats  needed  to  encode  k  using  some  prefix  code  for  the  numbers  0,  *  •  •  ,  n 
(note  that  k  has  to  be  encoded  to  allow  unique  decoding).  The  total  description  length 
of  the  yi  given  the  X{  becomes 

L(yn,  H\xn)  =  In  (fj  +  L(k)  +  LCl  (H).  (6.79) 

Another  way  to  arrive  at  a  two-part  code  for  the  data  would  be  to  first  entropify  the 
concept  class  M  with  respect  to  the  0/ 1-error  function  and  then  to  encode  data  by  first 
coding  some  H ,  than  some  parameter  / 3  (using  a  fixed  code  C[  and  then  encoding  the 
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Hi  using  the  code  based  on  P(|77,  /?)).  This  would  take 

L(yn,H,(3 \xn)  =  ~\n(P(yn\H,[3,xn)  +  Lc[{(3)  +  LC[{H)  nats.  (6.80) 

We  proceed  to  show  that  (6.79)  and  (6.80)  approximately  coincide  and  hence  that  both 
ways  of  coding  the  data  are  essentially  equivalent.  By  substituting  (3  =  ln(l  —  9)  — 
ln(#),  we  find  that  the  class  (A/f)ERoi  can  be  parameterized  as  follows:  (.M)ERoi  = 
{P(-|77, 0,  *)|77  G  M,  0  <  9  <  1},  such  that,  if  ERoi(yn|77 ,  xn)  =  fc,  then 

P(yn\H,0,xn)  =  ek(l-e)n~k.  (6.81) 

Instead  of  coding  /3  we  can  also  code  the  9  corresponding  to  it.  We  can  therefore  rewrite 
(6.80)  as  follows 

L(yn,H,d\xn)  =  -ln(P(yn\H)0,xn))  +  Lc,(d)  +  Lc[(H)  nats,  (6.82) 

where  P(yn\H ,  xn)  is  given  by  (6.81).  The  maximum  likelihood  estimator  8  maxi¬ 
mizing  (6.81)  for  fixed  H  and  (xn,  yn)  is  given  by  8  —  k/n .  Based  on  H  and  8 ,  the 
number  of  nats  -  ln(P(yn\H ,  6 ,  x71))  needed  to  code  the  data  becomes 

-  ln(P(yn\H,  6,  xn ))  =  -ln(9k){  1  -  6)n~k  =  nW(0)  »  In  Q ,  (6.83) 

where  (1)  follows  by  straightforward  calculation  and  (2)  by  Stirling’s  approximation 
ln(n!)  —  nln(n)  -n  +  \n(\/2irn)  +  (9(^).  For  precise  bounds  on  |nW(0)  —  In  (^)|  see 
[20].  Since  8  —  k/n  and  hence,  when  n  is  known  (as  we  assume  in  this  example),  can 
be  reconstructed  from  k  only,  we  need  approximately  L(k)  nats  to  describe  9 ,  where 
L(k)  is  defined  as  above.  The  total  description  length  of  the  yi  then  becomes 

-  In (P(yn\H,  9,  xn))  «  In  (fj  +  L(k)  +  LCl  (H),  (6.84) 

which  is  seen  to  coincide  with  (6.79).  Hence  if  we  code  the  data  based  on  the  entropi- 
fied  model  class  (M)Er01  and  use  the  optimal  (3  (corresponding  to  the  optimal  8)  for 
given  D  and  77,  then  the  number  of  bits  we  need  coincides  with  the  number  of  bits 
needed  to  efficiently  encode  the  exceptions.  0 

The  arguments  given  above  suggest  that  entropification  can  serve  as  a  general  means 
to  apply  two-part  code  MDL  to  non-probabilistic  model  classes.  Of  course,  they  do  not 
prove  that  entropification  will  be  as  well-behaved  as  either  approach  described  in  [7] 
or  [125]  to  this  problem. 

When  a  probabilistic  model  class  is  used  in  two-part  code  MDL,  data  is  encoded  by 
first  encoding  some  model  8  and  then  coding  the  data  based  on  the  Shannon-Fano  code 
L(xn\8 )  =  -  log(P(.xn|0)).  Why  to  use  this  code  and  not  any  other  one?  There  are 
many  other  possibilities;  to  give  an  example,  we  could  map  each  model  6  to  the  code 
with  lengths  Lf(xn\8)  —  —  \og{\/ P{xn\8)/  y/P(zn\0))  which,  by  the  Kraft 

inequality, 

<  1, 

rr^E 


(6.85) 
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also  corresponds  to  a  probability  distribution  over  En.  Why  does  the  Shannon-Fano 
code  have  a  special  status?  The  justification  lies  in  the  fact  that,  by  using  the  Shannon- 
Fano  code,  the  code  length  of  the  data  precisely  reflects  the  probability:  if  P(Di\9)  = 
a  •  P(D2 |0),  then  L(P>i|0)  =  L(D2\9)  +  log  (a). 

Some  authors  prefer  a  different  (or  at  any  rate,  additional)  justification  based  on  the 
information  inequality  (6.24):  if  9  turns  out  to  be  the  ‘true*  model,  i.e.  data  is  generated 
by  repeated  sampling  from  9 ,  then  the  expected  code  length  Eo(L(Xn))  is  minimized. 
We  set  L(xn)  =  —  log(P(xn|0)).  By  using  the  Shannon-Fano  code,  we  map  each 
model  9  to  the  code  that  will  be  optimal  if  9  is  actually  true;  hence  it  is  the  code  that 
best  ‘suits’  9.  This  justification  of  the  use  of  the  Shannon-Fano  code  can  be  found  in 
[123].  We  have  always  had  some  doubts  about  this  argument,  for  two  reasons:  (a)  it 
does  not  say  anything  about  the  (realistic)  case  where  the  model  class  contains  models 
that  allow  us  to  compress  the  data  (hence  we  can  learn  something  about  the  data);  yet 
none  of  these  models  are  close  to  the  ‘true’  one  generating  the  data;  (b)  it  is  not  clear 
how  to  extend  this  argument  to  non-probabilistic  model  classes. 

Proposition  6.8  below  shows  how  entropification  allows  us  to  extend  the  Shannon- 
Fano  argument  to  a  more  general  case  which  includes  non-probabilistic  model  classes. 
Whereas  we  still  assume  the  existence  of  some  true  probability  distribution  generating 
the  data,  we  do  not  assume  any  more  that  it  is  contained  in  the  model  class  M  under 
consideration.  For  simplicity,  we  will  consider  only  the  unconditional,  ‘unsupervised’ 
case,  where  we  are  interested  in  coding  complete  outcomes  (and  not  just  y-values 
conditioned  on  x-values).  Formally,  we  consider  a  class  M  of  models  and  an  error 
function  ER  :  E  x  M  — >  R  (in  case  M  is  probabilistic  we  take  ER  to  be  the  logarithmic 
error).  We  let  9  =  (if,  /3)  index  a  model  in  ( M ).  Let  Q  be  the  class  of  probability 
distributions  P*  over  E  satisfying 

E{m(BR(X\H))  =  Ep*(er(X\H)).  (6.86) 

Let  C  be  the  class  of  all  code  length  functions  L  :  E  — >  M  U  {oo}  satisfying  the  Kraft 
inequality  (6.85).  We  are  now  ready  to  state  our  proposition.  We  discuss  its  implica¬ 
tions  after  the  proof. 

Proposition  6.8 

Let  M,  ER,  Q ,  9  and  C  be  as  above.  Let  L(-\9 )  E  C  be  the  code  length  function  of  the 
Shannon-Fano  code  for  9  =  (P,  /?),  restricted  to  one  outcome  xgE.  That  is 

L(x\H,  (5)  =  -log(P(x\H,p))  =  (3er(x\H)  +  ln(ZH(p)). 


We  have : 

1. 

L(-\9)  =  inf  sup  EP*(L(X\0)).  (6.87) 

p*£Q 

That  is,  L(’\9)  gives  the  shortest  worst-case  code  lengths ,  the  worst-case  being 
taken  overall  distributions  satisfying  (6.86). 

2.  Let ,  forgiven  H,  U h  be  the  smallest  interval  such  that \fx  :  ER(x\H)  £  U#. 

For  every  H  E  M  and  for  every  P*  E  Q  for  which  Ep*(er(X\H))  lies  in  the 
interior  of  Up,  there  exists  a  (3  such  that  (6.86)  holds. 
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Proof  proposition  6.8:  Define  t  =  E^h^(er(X\H))  =  Ep+(ER(X\H)).  As  is  clear 
from  the  regularity  conditions  for  error  functions,  the  probability  distribution  P(- \  H,  /3) 
is  the  maximum  entropy  distribution  for  the  constraint  E(er(X\H))  =  t.  Using 
(6.22),  we  have  that  E(er(X\H))  =  E^h^{L{X\H,  /3))  =  for  every 

P *  6  Q.  On  the  other  hand,  let  V  £  £  be  a  code  length  function  different  from 
Lf\H,  /3).  By  (6.22),  there  exists  a  P*  €  G  (namely,  P*  =  P(-|i7,  /?))  such  that 
Ep*{Lf{X))  >  Ti.(H,/3)'  This  proves  (1).  To  prove  (2),  note  that  the  class  of  max¬ 
imum  entropy  models  for  function  ER(X|i7)  coincides  with  the  class  of  models  in 
(A4)Er  restricted  to  fixed  H.  Let  us  denote  this  subclass  by  Mme-  By  proposition  6.1, 
for  each  t  in  the  interior  of  U,  M me  contains  a  model  satisfying  P(er(X|77))  =  t. 
This  proves  (2).  □ 

This  proposition  shows  that  the  Shannon-Fano  code  for  models  9  in  entropified  model 
classes  (a)  leads  to  codes  that  are  worst-case  optimal  if  the  probability  distribution 
9  is  ‘true’  only  in  the  sense  that  its  expectation  of  the  error  coincides  with  the  true 
expectation  of  error,  and  (b)  that  entropified  model  classes  always  (except  possibly  for 
P*  with  expected  errors  at  the  boundaries  of  the  error  space)  contain  a  model  that  is 
‘true’  in  this  weak  respect. 

If  one  uses  a  non-probabilistic  model  class  M,  one  usually  does  not  have  a  clear  idea 
about  the  distribution  generating  the  data.  If  one  is  at  all  willing  to  assume  that  such  a 
distribution  nevertheless  exists,  then  it  seems  reasonable  to  make  as  few  assumptions 
as  possible  about  it.  This  directly  leads  to  our  worst-case  scenario,  which  really  says 
that  every  i.i.d.  distribution  is  a  possible  candidate  for  generating  the  data.  That  is 
why  we  regard  this  proposition  as  justifying  the  use  of  the  Shannon-Fano  code  for  the 
entropified  (probabilistic  version)  of  M.  We  hasten  to  add  though  that  there  do  exist 
codes  (based  on  non-i.i.d.  model  classes)  whose  expected  lengths  under  every  P*  are 
arbitrarily  close  to  that  of  the  Shannon-Fano  code.  An  example  is  the  code  based  on 
the  universal  computer  language. 

The  proposition  also  has  something  to  say  about  the  case  where  M  itself  is  proba¬ 
bilistic  and  we  entropify  with  respect  to  the  logarithmic  error  ER ig.  If  M  is  itself  an 
exponential  family,  this  will  not  change  M  (example  6.3),  and  the  proposition  tells 
us  that  the  Shannon-Fano  code  for  M  is  optimal  not  only  in  the  case  that  the  data  is 
generated  by  one  of  the  models  in  M ,  but  also  in  the  case  it  is  generated  by  some  i.i.d. 
model  not  in  M.  If  M  is  not  an  exponential  family,  then  the  usual  optimality  of  the 
Shannon-Fano  code  holds  for  the  models  M,  while  ‘worst-case’  optimality  holds,  by 
proposition  6.8,  for  the  models  in  (M)mig.  Whether  one  should  entropify  or  not  then 
depends  on  whether  one  thinks  that  one  of  the  models  in  the  class  will  be  very  close 
to  being  ‘truly  a  true  model’:  if  one  entropifies,  one  adds  an  extra  dimension  to  the 
parameter  space.  This  can  lead  to  logarithmically  larger  code  lengths;  if  M  contains 
the  true  model,  then  it  will  lead  even  with  probability  1  to  larger  code  lengths  of  the 
data,  when  data  is  encoded  using  either  the  stochastic  complexity  or  the  two-part  MDL 
code.  However,  if  the  true  model  is  not  in  M,  then  using  (-M)ERiff  instead  of  M  can 
sometimes  lead  to  a  linear  decrease  in  code  lengths.  We  briefly  show  why. 

To  see  that  if  M  contains  the  true  model,  one  will  need  more  bits  to  encode  the  data 
based  on  (M)EKlg  rather  than  M.  Now,  we  use  a  result  described  in  [17].  In  the  paper  it 
is  proved  that  an  analogue  of  the  asymptotic  expansion  of  stochastic  complexity  exists 
for  the  case  where  data  is  distributed  according  to  one  of  the  models  in  a  (probabilistic) 
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model  class  M.  Let  M  be  a  probabilistic  model  class  consisting  of  i.i.d.  models.  In 
[17]  it  is  shown  that,  if  the  data  are  generated  by  one  of  the  models  9*  in  M,  then 
under  some  mild  further  conditions  on  M 

Lsc(xn\M)  =  -  log(P(xn|0*)  +  |  log(n)  +  0(1),  (6.88) 

with  9* -probability  1.  Here  Lsc(xn\M)  is  the  stochastic  complexity  of  xn  with  respect 
to  M  and  k  is  the  number  of  parameters  needed  for  parameterizing  M.  The  two-part 
code  length  is  within  0(1)  of  the  stochastic  complexity.  Observe  that  if  the  true  model 
0*  is  in  M,  then  it  is  also  in  ( M  )ERis.  Therefore,  by  (6.88),  using  (.A/()ERjff,  the  number 
of  parameters  k  is  increased  by  1  which  results  in  a  logarithmic  increase  in  code  length 
(with  probability  1).  If  9 *  is  not  in  M  then,  supposing  M  is  finitely  parameterized  by 
T  6  Rfc,  the  asymptotic  expansion  of  both  stochastic  complexity  and  two-part  code 
gives 

Lsc(xn\M)  =  -  log(P(xn\0(xn)))  +  |  log(n)  +  0(1). 

By  applying  lemma  6.2,  one  sees  that  with  9* -probability  1 

fUmo-iLsc(a:n|7W)  -  min  £<,.(- ln(P(X|0))). 

This  will  hold  for  both  the  original  model  class  M  and  its  entropified  version  (,M)ER^ 
(in  the  latter  case,  the  values  in  /3  have  to  be  restricted  to  a  compact  set).  By  proposition 
6.8,  there  exists  a  9*  such  that 

min  £0.(-ln(P(*|6>)))  <  min Ee*(-  ln(P(X|0))), 

)EKlg 

where  we  used  (T)ERlg  to  denote  the  parameterization  of  (A4)ERi5.  In  such  a  case, 
both  the  two-part  and  the  stochastic  complexity  code  based  on  (A1)ERj5  will  clearly 
achieve  more  compression  (by  a  linear  amount)  than  the  codes  based  on  M.  Since 
M  C  (A1)ERz5,  the  opposite  event  (the  code  based  on  M  achieving  a  linear  gain  in 
compression  compared  to  the  code  based  on  (A1)ER^)  has  zero  probability  for  any  0*. 


6.3  Discussion 

We  have  introduced  the  concept  of  ‘entropification’  and  shown  how  it  can  be  used  in 
the  context  of  estimating  prediction  error  and  in  the  context  of  MDL.  We  leave  detailed 
conclusions  to  the  epilogue,  where  we  discuss  how  the  results  obtained  in  this  chapter 
can  be  used  to  partially  resolve  the  problematic  issues  concerning  MDL. 
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7.  Epilogue:  using  models  in  a  careful  way 


In  this  discussion  we  study  whether  we  have  resolved  the  problematic  issues  concern¬ 
ing  MDL.  Briefly  these  are 

1.  Can  we  use  models  that  are  partially  wrong  to  give  reasonable  predictions  of 
future  data?  If  so,  what  can  we  do  with  them  and  what  not? 

2.  What  to  do  if  we  are  asked  to  use  our  model  to  predict  future  data  while  utiliz¬ 
ing  various  loss  functions? 

3.  When  using  a  probabilistic  model  class,  why  should  we  use  the  Shannon-Fano 
code  L(D\d)  =  -  log(P(D\9))  to  encode  the  data  with  the  help  of  a  model? 
When  using  a  non-probabilistic  model  class,  why  should  we  use  the  code  for 
which  L(D\H)  =  er(D\H)  +  K  for  all  D? 

From  the  point  of  view  of  the  MDL  philosophy,  we  choose  a  model  class  M  because 
we  think  it  will  help  us  to  capture  some  of  the  regularity  inherent  in  the  data  -  but 
we  have  no  hope  that  it  will  capture  all  (except  if  M  corresponds  to  the  class  of  all 
computer  programs  written  in  some  language  -  but  then  the  inference  process  becomes 
non-computable).  This  is  the  situation  we  will  usually  be  in:  all  our  models  will  al¬ 
ways,  to  some  extent,  be  wrong.  Therefore,  though  the  question  came  up  in  connection 
with  the  MDL  philosophy,  it  should  be  relevant  not  only  to  MDL,  but  to  all  statistical 
inference  procedures. 

Once  we  accept  the  fact  that  our  model  H  for  data  D  will  always  be  partially  wrong, 
we  are  faced  with  the  question  of  what  can  be  reliably  inferred  from  such  a  model  and 
what  not.  We  can  always  change  our  model  classes  in  such  a  way  that  we  can  reliably 
estimate  the  prediction  error  over  future  data.  This  will  lead,  with  high  probability,  to 
accurate  estimates  of  error  over  future  data  even  if  the  data  are  independently  drawn 
according  to  a  distribution  that  is  completely  different  from  our  model.  Hence,  if  we 
are  willing  to  make  the  i.i.d.  assumption,  then  as  long  as  we  measure  the  error  our 
model  makes  when  predicting  future  data  using  the  same  error  function  as  the  one 
that  was  used  in  inferring  the  model  from  the  data,  we  can  use  the  model  H  reliably: 
even  though  it  is  partially  wrong,  it  will  give  a  correct  impression  of  how  accurate 
it  is  in  predicting  future  data.  Note  however,  that  we  used  the  i.i.d.  assumption  -  so 
we  still  have  to  assume  something  about  ‘the  truth  out  there’.  Hence,  we  have  not 
resolved  in  general  whether  using  an  overly  simply  (that  is,  partially  wrong)  model 
can  lead  to  ‘disastrous’  results.  Prediction  errors  may  be  accurately  estimated  under  a 
much  wider  assumption  than  the  classical  assumption  that  one’s  model  class  contains 
the  true  model.  The  question  remains  of  whether  this  is  not  too  weak;  whether  we 
will  not  always  be  interested  in  estimating  more  aspects  about  the  data  than  just  their 
prediction  error.  The  example  below  shows  that  sometimes  ‘reliable’  predictions  are 
enough,  while  at  the  same  time  ‘unreliable’  predictions  can  lead  to  very  misleading 
results. 

Example  7.1  (Classification) 

Recall  that  in  concept  learning  the  model  class  consists  of  functions  H  :  — ►  {0, 1}. 
Frequently  the  goal  will  be  to  use  the  concept  H  learned  on  the  basis  of  data  D  to 
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classify  new  data:  one  is  given  a  value  x  G  Ex  and  one  has  to  predict  the  corresponding 
y  G  {0, 1}.  Suppose  one  uses  a  class  of  concepts  M  entropified  with  the  0/1-error. 
Suppose  further  that,  for  given  data  D ,  the  estimate  inferred  is  ( H ,  9).  This  estimate 
says  that  if  the  model  H  is  used,  then  the  probability  of  making  a  wrong  prediction  is 
9.  Let  us  assume  that  the  data  are  i.i.d.  according  to  a  model  P*  so  that  if  D  is  large 
enough  and  one  uses  a  reasonable  estimation  procedure,  then  9  &  P*(H(X)  ^  Y). 
This  means  that  if  we  are  only  interested  in  classifying  future  data,  the  component  6 
of  our  model  ( H ,  9)  will  give  us  a  good  idea  on  just  how  well  we  can  do  that.  Hence 
if  we  use  our  model  only  for  classification,  we  have  neither  an  overly  optimistic  nor 
an  overly  pessimistic  of  how  good  our  model  is  at  this  task.  Now  let  us  consider  a 
specific  example  where  the  model  ( H)  9)  as  issued  by  our  estimation  procedure  on  the 
basis  of  a  large  data  set  D  =  (rr71,  yn)  has  9  —  0.95.  This  means  that,  if  D  is  really 
large,  we  can  predict  future  data  with  95%  accuracy.  However,  it  may  be  the  case 
that,  for  the  Xi  where  H(xi)  =  1,  yi  is  always  equal  to  1  while  for  the  cases  where 
H(xi )  =  0,  yi  ^  H(xi )  half  the  time.  If  H(xi)  —  1  in  90%  of  the  cases,  then  we  will 
have  9  ^  0.95  while,  if  a  new  value  x  is  given  such  that  H  —  0  and  we  use  H(x)  for 
prediction,  we  will  only  be  right  in  about  50%  of  the  cases.  Hence  our  model  is  very 
bad  for  new  data  with  H(x)  =  0,  and  if  a  loss  function  is  used  such  that  predicting 
a  ‘false  zero’  leads  to  a  much  higher  loss  than  when  predicting  a  ‘false  one’,  then  our 
model  will  really  be  quite  worthless.  Prediction  utilizing  such  a  loss  function  is  not 
reliable  -  and  indeed,  if  we  stick  to  ‘safe  statistics’,  we  are  not  allowed  to  make  such 
a  prediction.  We  used  an  extreme  example,  but  similar  examples,  where  there  exists  a 
very  simple  rule  that  gives  accurate  predictions  for  a  large  subset  of  the  x*  while  being 
quite  bad  on  the  remaining  X{  do  occur  in  practice.  0 

If  we  use  an  entropified  model  class  {.M)ER,  and  we  want  to  use  our  estimate  ( H ,  /3) 
to  make  predictions  or  decisions  using  loss  functions  that  cannot  be  written  as  a  linear 
combination  of  ER  we  see  that,  at  least  if  the  sample  space  E  is  discrete,  this  will  often 
still  work  -  but  it  will  also  often  be  unreliable.  That  is  because  the  model  class  {A4)ER 
restricted  to  models  with  fixed  H  is  essentially  a  maximum  entropy  model  class.  In 
this  case,  it  tells  us  that  for  an  exponentially  large  majority  of  those  data  sets  to  which 
(H,  j3 )  gives  a  good  fit,  the  frequencies  (71,  •  •  •  ,  7*)  will  be  approximately  equal  to  the 
probabilities  P(l|P,/3),  •  ■  •  ,P{k\H ,{!)).  If  future  data  indeed  belongs  to  this  majority, 
then  the  average  of  every  function  (hence  also  every  loss  function)  over  future  data 
will  be  approximately  equal  to  its  expectation  over  (P,/3),  and  the  predictions  will 
be  accurate.  Only  in  a  few  cases,  where  the  frequencies  and  the  probabilities  do  not 
coincide,  the  predictions  will  not  be  accurate  -  nevertheless,  as  we  saw  in  the  example 
above,  these  cases  may  certainly  occur. 

This  leaves  us  with  justifying  the  use  of  the  Shannon-Fano  code  and  how  to  asso¬ 
ciate  codes  with  non-probabilistic  models.  When  using  entropified  model  classes,  the 
Shannon-Fano  code  can  be  justified  in  terms  of  minimizing  the  worst-case  expected 
code  length  for  probabilistic  and  non-probabilistic  model  classes  alike.  This  makes 
‘entropification’  a  quite  general  means  of  turning  model  classes  into  codes.  As  such,  it 
is  in  line  with  the  general  MDL  philosophy,  in  which  all  models  are  viewed  as  prob¬ 
abilistic,  or  more  properly,  as  codes.  Let  us  consider  a  quote  by  Rissanen  ([93],  page 
20) 

‘...  we  then  see  that  the  unification  obtained  by  interpreting  all  models 
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as  probabilistic  has  given  us  an  immutable  yardstick,  the  code  length, 
which  we  never  can  reduce  to  zero  by  scaling  or  other  devices.  The 
same  cannot  be  said  about  the  usually  suggested  prediction  error  mea¬ 
sures,  which  we  easily  can  scale  to  any  size,  and  which  therefore  will 
never  be  able  to  serve  as  a  universal  yardstick  for  model  selection/ 

We  agree  that  it  is  desirable  and  probably  even  possible  to  compare  all  models  and 
model  classes  for  given  data  D  in  terms  of  the  code  length  they  assign  to  D.  But  also 
we  think  that  Rissanen’s  view  leaves  open  two  questions:  first,  how  to  base  predictions 
and  decisions  on  a  ‘probabilistic’  model  -  since  the  main  interpretation  of  the  model 
is  a  code  rather  than  a  probability  distribution  according  to  which  data  are  distributed, 
it  is  not  a  priori  clear  how  this  should  be  done.  The  second  question  is  how  to  change 
model  classes  that  are  normally  viewed  as  being  ‘non-probabilistic’  into  associated 
probabilistic  model  classes  in  a  principled  way.  As  we  see  it,  the  concept  of  ‘reliable 
estimation’  is  a  step  towards  answering  the  first  question,  while  ‘entropification’  is  a 
step  towards  answering  the  second  question. 
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Occam’s  razor:  ‘if  presented  with  a  choice  between  indifferent  alternatives,  then  one  ought  to  select  the  simplest  one.’ 
Reliable  inferences  allow  one  to  make  good  predictions  and  decisions  regarding  the  data  under  a  much  wider  variety  of 
assumptions  than  unreliable  inferences  do.  It  will  allow  us  to  establish  in  what  way  we  can  and  in  what  way  we  cannot  use 
overly  simple  models.  In  general,  we  will  be  interested  in  what  can  be  reliably  predicted  -  and  what  not  -  from  a  model  that 
is  only  partially  correct.  We  describe  a  new  procedure  called  entropification.  With  an  entropified  model,  if  given  enough 
data,  we  can  find  the  model  with  the  smallest  expected  prediction  error.  This  model  will  provide  a  correct  estimate  of  the 
average  prediction  error  that  it  will  achieve;  hence  the  model  gives  a  good  impression  of  ‘how  good  it  really  is.’ 
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